AI Benchmark Digest — 2026-05-29


            
        May 29, 2026
    
    
AI Benchmark Digest — 2026-05-29


        AI Benchmark Digest — 2026-05-29
=== DAILY ===
NEW BENCHMARKS (1)
  - DeepSWE (Pass@1 (%)): leader GPT-5.5 (xHigh) (70.0), 12 models
      DataCurve benchmark measuring frontier coding agents on original, long-horizon software engineering tasks. Reports pass rates for model configurations on realistic repository work.
NEW MODELS (1)
  - Claude Opus 4.8 — ELO 1801, #52/869 (above: DeepSeek V4 Flash (Max), below: GPT-5.2 (Medium))
      Clerk LLM Leaderboard: 91.3 (#1/19)
      Vellum - HumanEval: 88.6 (#1/36)
      Vellum - Humanity's Last Exam: 57.9 (#1/20)
      LLM Stats (DeepSearchQA): 93.1 (#1/6)
      LLM Stats (Include): 87.6 (#1/30)
      LLM Stats (OSWorld-Verified): 83.4 (#1/14)
      LLM Stats (ScreenSpot Pro): 87.9 (#1/22)
      LLM Stats (Toolathlon): 59.9 (#1/20)
      FrontierSWE: 83.0 (#1/11)
      Vals AI (Vals Index): 70.17 (#1/20)
NEW SCORES FROM TOP-10 MODELS (2)
  - GPT-5.5 (High) on WebDev Arena: 1478.93 Arena Score (#16/67)
  - GPT-5.5 (xHigh) on WebDev Arena: 1504.74 Arena Score (#12/67)
NEW #1 LEADERS (16)
  - AA GDPval (ELO): Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (1889.8) beat GPT-5.5 (xHigh) (1769.3) by 120.5
  - Vellum - Humanity's Last Exam (Accuracy (%)): Claude Opus 4.8 (57.9) beat Gemini 3 Pro (45.8) by 12.1
  - Clerk LLM Leaderboard (Avg score (%)): Claude Opus 4.8 (91.3) beat GPT-5.4 (79.5) by 11.8
  - Vals AI Vibe Code Bench (Accuracy (%)): Claude Opus 4.8 (82.72) beat Claude Opus 4.7 (71.0) by 11.72
  - Epoch AI - Apex Agents (Score): gemini-3.5-flash_unknown (49.6) beat GPT-5.5 (xHigh) (38.4) by 11.2
  - LLM Stats (OSWorld-Verified) (Score (%)): Claude Opus 4.8 (83.4) beat Claude Mythos Preview (79.6) by 3.8
  - LLM Stats (Toolathlon) (Score (%)): Claude Opus 4.8 (59.9) beat Gemini 3.5 Flash (56.5) by 3.4
  - Vals AI Multimodal Index (Accuracy (%)): Claude Opus 4.8 (70.71) beat GPT-5.5 (67.77) by 2.94
  - Vals AI (Vals Index) (Accuracy (%)): Claude Opus 4.8 (70.17) beat GPT-5.5 (67.62) by 2.55
  - LLM Stats (DeepSearchQA) (Score (%)): Claude Opus 4.8 (93.1) beat Claude Opus 4.6 (91.3) by 1.8
  - LLM Stats (ScreenSpot Pro) (Score (%)): Claude Opus 4.8 (87.9) beat GPT-5.2 (86.3) by 1.6
  - LLM Stats (Include) (Score (%)): Claude Opus 4.8 (87.6) beat Qwen 3.7 Max (86.2) by 1.4
  - Artificial Analysis Intelligence Index (Intelligence Index): Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (61.44) beat GPT-5.5 (xHigh) (60.24) by 1.2
  - PinchBench (Success Rate (%)): Claude Opus 4.8 Fast (94.49) beat Qwen Max (93.44) by 1.05
  - AA Humanity's Last Exam (Accuracy (%)): Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (45.74) beat Gemini 3.1 Pro (Preview) (44.72) by 1.02
  - Vellum - HumanEval (Pass@1 (%)): Claude Opus 4.8 (88.6) beat Claude Opus 4.7 (87.6) by 1.0

View on AI Benchmark Hub
    

                                Don't miss what's next. Subscribe to Mikhail Doroshenko:
                            
                        
            Email address (required)
            
            
                    ← Newer
                
                AI Benchmark Digest — 2026-05-30
            
        
                    Older →
                
                AI Benchmark Digest — 2026-05-28