AI Benchmark Digest — 2026-05-24


            
        May 24, 2026
    
    
AI Benchmark Digest — 2026-05-24


        AI Benchmark Digest — 2026-05-24
=== DAILY ===
NEW BENCHMARKS (14)
  - NanoGPT-Bench (% of Human Progress Recovered): leader Claude Opus 4.6 (9.3), 2 models
      Autonomous research benchmark built on the NanoGPT Speedrun, measuring how much of five months of human pretraining-speedup progress coding agents recover under a fixed H100 compute budget.
  - CursorBench 3.1 (Score (%)): leader Claude Opus 4.7 (64.8), 7 models
      Cursor benchmark of ambiguous, multi-file coding tasks from real Cursor sessions, with models scored by task success percentage and average cost per task.
  - SMDD-Bench (Pass Rate (%)): leader GPT-5.4 (Medium) (40.2), 7 models
      Small molecule drug design agent benchmark with sandboxed Python, Boltz structure prediction, and ADMET tooling. Measures pass rate across 502 computationally verifiable chemistry tasks.
  - SMDD-Bench Diversity (Avg Successful): leader Claude Sonnet 4.6 (8.4), 7 models
      SMDD-Bench diversity slice measuring whether agents generate multiple distinct, novel, successful molecule designs across repeated Lead Optimization rollouts.
  - Blueprint-Bench 2 (Connectivity Similarity Score): leader GPT 5.5 (0.362), 12 models
      Andon Labs spatial reasoning benchmark where agents convert apartment photographs into 2D floor plans, scored by normalized connectivity similarity against ground truth layouts.
  - PACT (Lechmazur) (CMS Points): leader GPT-5.5 (high) (59.0), 25 models
      Pairwise Auction Conversation Testbed for multi-round buyer-seller bargaining. LLMs negotiate over 20 rounds with hidden private values, scored by Composite Model Score from head-to-head surplus capture.
  - FormationEval (Accuracy (%)): leader gemini-3-pro-preview (99.8), 72 models
  - Chinese Classical Bench (Average Score (%)): leader claude-opus-4-7 (66.21), 10 models
  - Chinese Classical Bench - Translate Judge (Score (%)): leader claude-opus-4-7-thinking (80.2), 10 models
  - Chinese Classical Bench - Punctuate Punct F1 (Score (%)): leader claude-opus-4-7 (80.02), 10 models
  - Chinese Classical Bench - Char-Gloss Judge (Score (%)): leader claude-opus-4-7-thinking (73.6), 10 models
  - Chinese Classical Bench - Idiom-Source Book EM (Score (%)): leader deepseek-3.2 (74.0), 10 models
  - Chinese Classical Bench - Fill-In Exact (Score (%)): leader claude-opus-4-7-thinking (88.0), 10 models
  - Chinese Classical Bench - Compress Efficiency (Score (%)): leader deepseek-3.2 (16.32), 9 models
NEW SCORES FROM TOP-10 MODELS (1)
  - Gemini 3.1 Pro (High) on CLBench: 20.8 Solving Rate (%) (#8/36)
NEW #1 LEADERS (5)
  - Evals for Every Language (Average Score (%)): Gemini 3.1 Pro (69.11) beat Gemini 2.5 Flash (62.59) by 6.52
  - CLBench (Solving Rate (%)): GPT-5.4 (xHigh) (27.9) beat GPT-5.1 (High) (23.7) by 4.2
  - LiveBench Logic With Navigation (Score): Qwen Max (84.0) beat Claude Opus 4.6 (Thinking) (80.0) by 4.0
  - Spider 2.0-Lite (Accuracy (%)): DivSkill-SQL (73.13) beat SOMA-SQL (72.02) by 1.11
  - PinchBench (Success Rate (%)): Grok 0.1 (92.07) beat Claude Opus 4.7 (91.58) by 0.49

View on AI Benchmark Hub
    

                                Don't miss what's next. Subscribe to Mikhail Doroshenko:
                            
                        
            Email address (required)
            
            
                    ← Newer
                
                AI Benchmark Digest — 2026-05-25
            
        
                    Older →
                
                AI Benchmark Digest — 2026-05-23