AI Benchmark Digest — 2026-06-12
AI Benchmark Digest — 2026-06-12
=== DAILY === NEW BENCHMARKS (2) - MathArena - ARXIV_FALSE May (Accuracy (%)): leader GPT-5.5 (xhigh) (50.0), 8 models - MathArena - ARXIV May (Accuracy (%)): leader Claude-Fable-5 (max) (86.67), 8 models
NEW SCORES FROM TOP-10 MODELS (9) - Claude Fable 5 on Lynchmark: 100.0 Pass Rate (%) (#1/13) - Claude Fable 5 on MineBench: 1929.84 Elo Rating (#2/45) - Claude Opus 4.8 on Chess Puzzles (Epoch AI): 34.0 Accuracy (%) (#12/43) - Claude Opus 4.8 on Design Arena (Game Dev): 1250.0 Elo (#37/126) - Claude Opus 4.8 on GRAB-Lite: 60.6 Overall Score (#6/38) - Claude Opus 4.8 on OTIS Mock AIME 2024-25: 98.33 Accuracy (%) (#3/142) - Claude Opus 4.8 on SimpleQA Verified: 39.5 Accuracy (%) (#24/52) - GPT-5.5 on GRAB-Lite: 71.8 Overall Score (#2/38) - Qwen 3.7 Max on Position Bias (Lechmazur): 34.8 Order Flip % (lower is better) (#10/36)
NEW #1 LEADERS (9) - Chatbot Arena (Text-to-Video) (Elo): gemini-omni-flash (1527.0) beat dreamina-seedance-2.0-720p (1463.0) by 64.0 - Design Arena (UI Components) (Elo): Claude Fable 5 (1411.0) beat Claude Opus 4.7 (Thinking) (1355.0) by 56.0 - Design Arena (Game Dev) (Elo): Claude Fable 5 (1393.0) beat GPT-5.5 (1354.0) by 39.0 - Design Arena (SVG) (Elo): Claude Fable 5 (1384.0) beat prism (1366.0) by 18.0 - SEAL - SWE Atlas - Test Writing (Score): Fable-5 (Claude Code) xHigh (58.52) beat Opus 4.8 (Claude Code) (45.56) by 12.96 - MathArena - ARXIV April (Accuracy (%)): Claude 5 (70.73) beat GPT-5.5 (xHigh) (67.07) by 3.66 - GRAB-Lite (Overall Score): Claude Fable 5 (74.0) beat GPT-5.4 (71.0) by 3.0 - WeirdML (Average Score): Claude 5 (87.85) beat GPT-5.5 (xHigh) (84.91) by 2.94 - Chatbot Arena (Image-to-Video) (Elo): gemini-omni-flash (1475.0) beat Grok 1.5 (1473.0) by 2.0