Mikhail Doroshenko

Archives
Log in
May 24, 2026

AI Benchmark Digest — 2026-05-24

AI Benchmark Digest — 2026-05-24

=== DAILY === NEW BENCHMARKS (14) - NanoGPT-Bench (% of Human Progress Recovered): leader Claude Opus 4.6 (9.3), 2 models Autonomous research benchmark built on the NanoGPT Speedrun, measuring how much of five months of human pretraining-speedup progress coding agents recover under a fixed H100 compute budget. - CursorBench 3.1 (Score (%)): leader Claude Opus 4.7 (64.8), 7 models Cursor benchmark of ambiguous, multi-file coding tasks from real Cursor sessions, with models scored by task success percentage and average cost per task. - SMDD-Bench (Pass Rate (%)): leader GPT-5.4 (Medium) (40.2), 7 models Small molecule drug design agent benchmark with sandboxed Python, Boltz structure prediction, and ADMET tooling. Measures pass rate across 502 computationally verifiable chemistry tasks. - SMDD-Bench Diversity (Avg Successful): leader Claude Sonnet 4.6 (8.4), 7 models SMDD-Bench diversity slice measuring whether agents generate multiple distinct, novel, successful molecule designs across repeated Lead Optimization rollouts. - Blueprint-Bench 2 (Connectivity Similarity Score): leader GPT 5.5 (0.362), 12 models Andon Labs spatial reasoning benchmark where agents convert apartment photographs into 2D floor plans, scored by normalized connectivity similarity against ground truth layouts. - PACT (Lechmazur) (CMS Points): leader GPT-5.5 (high) (59.0), 25 models Pairwise Auction Conversation Testbed for multi-round buyer-seller bargaining. LLMs negotiate over 20 rounds with hidden private values, scored by Composite Model Score from head-to-head surplus capture. - FormationEval (Accuracy (%)): leader gemini-3-pro-preview (99.8), 72 models - Chinese Classical Bench (Average Score (%)): leader claude-opus-4-7 (66.21), 10 models - Chinese Classical Bench - Translate Judge (Score (%)): leader claude-opus-4-7-thinking (80.2), 10 models - Chinese Classical Bench - Punctuate Punct F1 (Score (%)): leader claude-opus-4-7 (80.02), 10 models - Chinese Classical Bench - Char-Gloss Judge (Score (%)): leader claude-opus-4-7-thinking (73.6), 10 models - Chinese Classical Bench - Idiom-Source Book EM (Score (%)): leader deepseek-3.2 (74.0), 10 models - Chinese Classical Bench - Fill-In Exact (Score (%)): leader claude-opus-4-7-thinking (88.0), 10 models - Chinese Classical Bench - Compress Efficiency (Score (%)): leader deepseek-3.2 (16.32), 9 models

NEW SCORES FROM TOP-10 MODELS (1) - Gemini 3.1 Pro (High) on CLBench: 20.8 Solving Rate (%) (#8/36)

NEW #1 LEADERS (5) - Evals for Every Language (Average Score (%)): Gemini 3.1 Pro (69.11) beat Gemini 2.5 Flash (62.59) by 6.52 - CLBench (Solving Rate (%)): GPT-5.4 (xHigh) (27.9) beat GPT-5.1 (High) (23.7) by 4.2 - LiveBench Logic With Navigation (Score): Qwen Max (84.0) beat Claude Opus 4.6 (Thinking) (80.0) by 4.0 - Spider 2.0-Lite (Accuracy (%)): DivSkill-SQL (73.13) beat SOMA-SQL (72.02) by 1.11 - PinchBench (Success Rate (%)): Grok 0.1 (92.07) beat Claude Opus 4.7 (91.58) by 0.49


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.