AI Benchmark Digest — 2026-05-29
AI Benchmark Digest — 2026-05-29
=== DAILY === NEW BENCHMARKS (1) - DeepSWE (Pass@1 (%)): leader GPT-5.5 (xHigh) (70.0), 12 models DataCurve benchmark measuring frontier coding agents on original, long-horizon software engineering tasks. Reports pass rates for model configurations on realistic repository work.
NEW MODELS (1) - Claude Opus 4.8 — ELO 1801, #52/869 (above: DeepSeek V4 Flash (Max), below: GPT-5.2 (Medium)) Clerk LLM Leaderboard: 91.3 (#1/19) Vellum - HumanEval: 88.6 (#1/36) Vellum - Humanity's Last Exam: 57.9 (#1/20) LLM Stats (DeepSearchQA): 93.1 (#1/6) LLM Stats (Include): 87.6 (#1/30) LLM Stats (OSWorld-Verified): 83.4 (#1/14) LLM Stats (ScreenSpot Pro): 87.9 (#1/22) LLM Stats (Toolathlon): 59.9 (#1/20) FrontierSWE: 83.0 (#1/11) Vals AI (Vals Index): 70.17 (#1/20)
NEW SCORES FROM TOP-10 MODELS (2) - GPT-5.5 (High) on WebDev Arena: 1478.93 Arena Score (#16/67) - GPT-5.5 (xHigh) on WebDev Arena: 1504.74 Arena Score (#12/67)
NEW #1 LEADERS (16) - AA GDPval (ELO): Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (1889.8) beat GPT-5.5 (xHigh) (1769.3) by 120.5 - Vellum - Humanity's Last Exam (Accuracy (%)): Claude Opus 4.8 (57.9) beat Gemini 3 Pro (45.8) by 12.1 - Clerk LLM Leaderboard (Avg score (%)): Claude Opus 4.8 (91.3) beat GPT-5.4 (79.5) by 11.8 - Vals AI Vibe Code Bench (Accuracy (%)): Claude Opus 4.8 (82.72) beat Claude Opus 4.7 (71.0) by 11.72 - Epoch AI - Apex Agents (Score): gemini-3.5-flash_unknown (49.6) beat GPT-5.5 (xHigh) (38.4) by 11.2 - LLM Stats (OSWorld-Verified) (Score (%)): Claude Opus 4.8 (83.4) beat Claude Mythos Preview (79.6) by 3.8 - LLM Stats (Toolathlon) (Score (%)): Claude Opus 4.8 (59.9) beat Gemini 3.5 Flash (56.5) by 3.4 - Vals AI Multimodal Index (Accuracy (%)): Claude Opus 4.8 (70.71) beat GPT-5.5 (67.77) by 2.94 - Vals AI (Vals Index) (Accuracy (%)): Claude Opus 4.8 (70.17) beat GPT-5.5 (67.62) by 2.55 - LLM Stats (DeepSearchQA) (Score (%)): Claude Opus 4.8 (93.1) beat Claude Opus 4.6 (91.3) by 1.8 - LLM Stats (ScreenSpot Pro) (Score (%)): Claude Opus 4.8 (87.9) beat GPT-5.2 (86.3) by 1.6 - LLM Stats (Include) (Score (%)): Claude Opus 4.8 (87.6) beat Qwen 3.7 Max (86.2) by 1.4 - Artificial Analysis Intelligence Index (Intelligence Index): Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (61.44) beat GPT-5.5 (xHigh) (60.24) by 1.2 - PinchBench (Success Rate (%)): Claude Opus 4.8 Fast (94.49) beat Qwen Max (93.44) by 1.05 - AA Humanity's Last Exam (Accuracy (%)): Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (45.74) beat Gemini 3.1 Pro (Preview) (44.72) by 1.02 - Vellum - HumanEval (Pass@1 (%)): Claude Opus 4.8 (88.6) beat Claude Opus 4.7 (87.6) by 1.0