Mikhail Doroshenko

Archives
Log in
Subscribe
June 6, 2026

AI Benchmark Digest — 2026-06-06

AI Benchmark Digest — 2026-06-06

=== DAILY === NEW BENCHMARKS (20) - Pencil Puzzle Bench - Yajilin (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (20.0), 51 models PPBench direct-ask success rate on Yajilin loop-and-shading puzzles from the golden_300 split, testing exact constraint solving from puzz.link grids. - Pencil Puzzle Bench - Slitherlink (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (33.3), 51 models PPBench direct-ask success rate on Slitherlink loop puzzles, where numbered cells constrain how a single continuous loop surrounds the grid. - Pencil Puzzle Bench - Heyawake (Direct-ask Success Rate (%)): leader claude-opus-4-5-high (0.0), 51 models PPBench direct-ask success rate on Heyawake room-shading puzzles, testing region constraints, connectivity, and line-of-sight reasoning. - Pencil Puzzle Bench - Mashu (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (60.0), 51 models PPBench direct-ask success rate on Mashu loop puzzles, where black and white pearls impose turn and straight-line constraints. - Pencil Puzzle Bench - Shakashaka (Direct-ask Success Rate (%)): leader claude-sonnet-4-5 (0.0), 51 models PPBench direct-ask success rate on Shakashaka triangle-shading puzzles, testing local clue satisfaction and global rectangle formation. - Pencil Puzzle Bench - Nurikabe (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (33.3), 51 models PPBench direct-ask success rate on Nurikabe island puzzles, where numbered islands must be separated by one connected wall region. - Pencil Puzzle Bench - LITS (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (53.3), 51 models PPBench direct-ask success rate on LITS tetromino-shading puzzles, testing region-wise shape placement and adjacency constraints. - Pencil Puzzle Bench - Light Up (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (66.7), 51 models PPBench direct-ask success rate on Light Up puzzles, where lamps must illuminate every open cell while satisfying numbered black-cell clues. - Pencil Puzzle Bench - Nurimisaki (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (33.3), 51 models PPBench direct-ask success rate on Nurimisaki puzzles, a Nurikabe-family grid task requiring connected-region reasoning around clue cells. - Pencil Puzzle Bench - Shikaku (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (80.0), 51 models PPBench direct-ask success rate on Shikaku rectangle-partitioning puzzles, where each numbered clue defines one rectangle of matching area. - Pencil Puzzle Bench - Norinori (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (93.3), 51 models PPBench direct-ask success rate on Norinori shading puzzles, testing room constraints and two-cell adjacency patterns. - Pencil Puzzle Bench - Double Choco (Direct-ask Success Rate (%)): leader gemini-3.1-pro (6.7), 51 models PPBench direct-ask success rate on Double Choco region-division puzzles, testing balanced partitioning under color and shape constraints. - Pencil Puzzle Bench - Firefly (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (33.3), 51 models PPBench direct-ask success rate on Firefly line-drawing puzzles, testing path construction from directional clues and grid constraints. - Pencil Puzzle Bench - Sashigane (Direct-ask Success Rate (%)): leader mistral-large-2512 (0.0), 51 models PPBench direct-ask success rate on Sashigane shape-partitioning puzzles, testing right-angle region construction from numbered and directional clues. - Pencil Puzzle Bench - Sudoku (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (20.0), 51 models PPBench direct-ask success rate on Sudoku puzzles, testing classic row, column, and box constraint satisfaction through exact move outputs. - Pencil Puzzle Bench - Nurimaze (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (26.7), 51 models PPBench direct-ask success rate on Nurimaze puzzles, testing maze-style path and shading constraints in a connected grid. - Pencil Puzzle Bench - Tapa (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (60.0), 51 models PPBench direct-ask success rate on Tapa shading puzzles, where clue numbers describe blocks of shaded neighboring cells. - Pencil Puzzle Bench - Kurodoko (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (6.7), 51 models PPBench direct-ask success rate on Kurodoko visibility puzzles, testing shading, sight-line counts, and connected unshaded cells. - Pencil Puzzle Bench - Country (Direct-ask Success Rate (%)): leader gemini-3.1-pro (6.7), 51 models PPBench direct-ask success rate on Country region puzzles, testing loop and region constraints over a partitioned grid. - Pencil Puzzle Bench - Hitori (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (66.7), 51 models PPBench direct-ask success rate on Hitori number-grid puzzles, where repeated numbers are shaded while preserving connectivity and non-adjacency constraints.

NEW #1 LEADERS (24) - LLM Stats (Multi-Challenge) (Score (%)): Nova 2 Pro (77.7) beat GPT-5 (69.6) by 8.1 - Ukrainian LLM - Global MMLU Full UK World Religions (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (87.13) beat gemma-3-12B-pt (79.53) by 7.6 - Ukrainian LLM - Global MMLU Full UK High School US History (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (91.67) beat MamayLM-Gemma-3-12B-IT-v1.0 (86.27) by 5.4 - Ukrainian LLM - Global MMLU Full UK Anatomy (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (65.19) beat lapa-12B-pt (60.0) by 5.19 - Ukrainian LLM - Global MMLU Full UK Clinical Knowledge (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (77.74) beat gemma-3-12B-pt (73.21) by 4.53 - Ukrainian LLM - Global MMLU Full UK Professional LAW (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (51.5) beat gemma-3-12B-pt (47.07) by 4.43 - Ukrainian LLM - Global MMLU Full UK Humanities (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (61.68) beat Qwen3-8B-Base (57.56) by 4.12 - Ukrainian LLM - Global MMLU Full UK Computer Security (Score (%)): MamayLM-Gemma-3-12B-IT-v2.0 (82.0) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia (78.0) by 4.0 - Ukrainian LLM - Global MMLU Full UK Global Facts (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (52.0) beat Gemma 3 12B (IT) (48.0) by 4.0 - Ukrainian LLM - Global MMLU Full UK Miscellaneous (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (83.52) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia (79.57) by 3.95 - Ukrainian LLM - Global MMLU Full UK Prehistory (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (77.78) beat gemma-3-12B-pt (74.07) by 3.71 - Ukrainian LLM - Global MMLU Full UK Other (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (74.57) beat gemma-3-12B-pt (71.16) by 3.41 - Ukrainian LLM - Global MMLU Full UK Business Ethics (Score (%)): MamayLM-Gemma-3-12B-IT-v2.0 (77.0) beat MamayLM-Gemma-3-12B-IT-v1.0 (74.0) by 3.0 - Ukrainian LLM - Global MMLU Full UK High School World History (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (86.08) beat gemma-3-12B-pt (84.39) by 1.69 - Ukrainian LLM - Global MMLU Full UK High School Microeconomics (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (84.45) beat Qwen3-8B-Base (82.77) by 1.68 - Ukrainian LLM - Global MMLU Full UK Marketing (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (88.89) beat MamayLM-Gemma-3-12B-IT-v1.0 (87.61) by 1.28 - Ukrainian LLM - Global MMLU Full UK Professional Psychology (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (70.1) beat gemma-3-12B-pt (69.12) by 0.98 - Ukrainian LLM - Global MMLU Full UK Public Relations (Score (%)): MamayLM-Gemma-3-12B-IT-v2.0 (68.18) beat lapa-12B-pt (67.27) by 0.91 - Ukrainian LLM - Global MMLU Full UK High School European History (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (84.24) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia (83.64) by 0.6 - Ukrainian LLM - Global MMLU Full UK High School Macroeconomics (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (76.67) beat gemma-3-12B-pt (76.15) by 0.52 - Ukrainian LLM - Global MMLU Full UK Sociology (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (83.08) beat lapa-v0.1.2-instruct (82.59) by 0.49 - LLM Stats (OmniDocBench 1.5) (Score (%)): MiniMax-M3 (91.6) beat Qwen 3.6 Plus (91.2) by 0.4 - Ukrainian LLM - Global MMLU Full UK Professional Medicine (Score (%)): MamayLM-Gemma-3-27B-IT-v2.0 (80.15) beat gemma-3-12B-pt (79.78) by 0.37 - ForecastBench (Overall Score (higher is better)): Grok 4.20 (Beta, D) (68.1) beat green-tree (67.8) by 0.3


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.