Mikhail Doroshenko

Archives
Log in
June 10, 2026

AI Benchmark Digest — 2026-06-10

AI Benchmark Digest — 2026-06-10

=== DAILY === NEW BENCHMARKS (1) - SkateBench (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. SkateBench v2 reports success rate, cost, and speed.

NEW MODELS (2) - Claude 5 — ELO 1904, #22/983 (above: Claude Sonnet 4.6 (Thinking), below: Claude Fable 5) LiveBench Olympiad: 92.18 (#1/124) LiveBench Plot Unscrambling: 78.09 (#1/124) LiveBench Python: 95.0 (#1/124) Opper TaskBench: 96.4 (#1/85) Vals AI (Vals Index): 75.14 (#1/25) Vals AI Multimodal Index: 74.15 (#1/20) Vals AI LegalBench: 88.56 (#1/114) Vals AI CorpFin v2: 71.83 (#1/111) Vals AI MedScribe: 88.52 (#1/62) Vals AI ProofBench: 77.0 (#1/37) - Claude Fable 5 — ELO 1901, #23/983 (above: Claude 5, below: GPT-5 (Thinking, High)) Blueprint-Bench 2: 0.386 (#1/14) LLM Stats (OSWorld-Verified): 85.0 (#1/16) YC-Bench: 1977.6 (#1/21) SEAL - MCP Atlas: 83.3 (#2/23) Vellum - HumanEval: 95.0 (#2/38) Vellum - GPQA: 94.1 (#3/57) ClockBench: 35.0 (#4/27) LLM Stats (GDPval-AA): 64.4 (#11/12)

NEW #1 LEADERS (55) - YC-Bench (Net Worth ($K)): Claude Fable 5 (1977.6) beat Claude Opus 4.7 (1714.5) by 263.1 - Multi-turn Debate (Lechmazur) (Bradley-Terry Rating): Claude Fable 5 (High) (1770.9) beat Claude Opus 4.7 (High) (1717.1) by 53.8 - AA GDPval (ELO): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.47) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (1889.8) by 42.67 - Evals for Every Language - Language ay (Average Score (%)): step-3.7-flash-20260528 (77.14) beat Gemini 3.1 Pro (Preview) (62.91) by 14.23 - LiveBench Python (Score): Claude 5 (95.0) beat Claude Opus 4.5 (Thinking 64K, High) (2025-11-01) (85.0) by 10.0 - CursorBench 3.1 (Score (%)): Fable 5 Max (72.9) beat Claude Opus 4.7 (64.8) by 8.1 - AA Omniscience - Software Engineering (SWE) - Dart (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.3 Codex (xHigh) (80.0) by 8.0 - AA Omniscience - Software Engineering (SWE) - R (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (82.0) beat GPT-5.5 (Medium) (74.0) by 8.0 - AA Omniscience - Software Engineering (SWE) - Swift (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (100.0) beat GPT-5.5 (xHigh) (92.0) by 8.0 - Vals AI Vibe Code Bench (Accuracy (%)): Claude 5 (90.35) beat Claude Opus 4.8 (82.72) by 7.63 - AA Humanity's Last Exam (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (53.34) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (45.74) by 7.6 - AA Omniscience (Score): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (40.15) beat Gemini 3.1 Pro (Preview) (32.93) by 7.22 - Vellum - HumanEval (Pass@1 (%)): Claude Mythos 5 (95.5) beat Claude Opus 4.8 (88.6) by 6.9 - Vellum - Humanity's Last Exam (Accuracy (%)): Claude Mythos 5 (64.5) beat Claude Opus 4.8 (57.9) by 6.6 - Evals for Every Language - Language crh (Average Score (%)): step-3.7-flash-20260528 (73.05) beat Gemini 3.1 Pro (Preview) (66.78) by 6.27 - AA Omniscience - Software Engineering (SWE) - Java (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (79.0) beat GPT-5.3 Codex (xHigh) (73.0) by 6.0 - Vals AI ProofBench (Accuracy (%)): Claude 5 (77.0) beat aristotle (71.0) by 6.0 - AA Omniscience - Business (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (55.0) beat GPT-5.5 (xHigh) (49.1) by 5.9 - AA Omniscience - Science, Engineering & Mathematics (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (57.1) beat GPT-5.5 (High) (52.3) by 4.8 - Vals AI (Vals Index) (Accuracy (%)): Claude 5 (75.14) beat Claude Opus 4.8 (70.36) by 4.78 - Vals AI IOI (Accuracy (%)): Claude 5 (72.25) beat GPT-5.4 (2026-03-05) (67.83) by 4.42 - AA Omniscience - Humanities & Social Sciences (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.9) beat Gemini 3 Pro (Preview) (High) (56.6) by 4.3 - AA Omniscience - Software Engineering (SWE) - Go (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.5 (High) (84.0) by 4.0 - Artificial Analysis Intelligence Index (Intelligence Index): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (64.88) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (61.44) by 3.44 - Evals for Every Language - Language cv (Average Score (%)): gemma-4-31B-it-20260402 (69.3) beat Claude Opus 4.5 (65.91) by 3.39 - Vals AI CorpFin v2 (Accuracy (%)): Claude 5 (71.83) beat Grok 4.3 (68.53) by 3.3 - Vals AI Multimodal Index (Accuracy (%)): Claude 5 (74.15) beat Claude Opus 4.8 (70.89) by 3.26 - AA Omniscience - Software Engineering (SWE) (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (87.6) beat GPT-5.5 (xHigh) (84.4) by 3.2 - Evals for Every Language - MGSM (Average Score (%)): Claude Opus 4.8 (96.62) beat Claude Opus 4.6 (94.26) by 2.36 - Evals for Every Language - Language ban (Average Score (%)): step-3.7-flash-20260528 (69.03) beat Claude Opus 4.5 (66.71) by 2.32 - AA Terminal-Bench Hard (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (62.88) beat GPT-5.5 (xHigh) (60.61) by 2.27 - LiveBench Plot Unscrambling (Score): Claude 5 (78.09) beat GPT-5.5 (High) (76.28) by 1.81 - LLM Stats (OSWorld-Verified) (Score (%)): Claude Fable 5 (85.0) beat Claude Opus 4.8 (83.4) by 1.6 - AA Omniscience - Software Engineering (SWE) - Python (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (92.0) beat GPT-5.5 (xHigh) (90.5) by 1.5 - Evals for Every Language - Language chm (Average Score (%)): Claude Opus 4.7 (63.6) beat Gemini 3.1 Pro (Preview) (62.12) by 1.48 - Evals for Every Language - Language doi (Average Score (%)): Claude Opus 4.7 (71.84) beat Gemini 3 Pro (Preview) (70.38) by 1.46 - AA CritPt (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (28.57) beat GPT-5.5 (xHigh) (27.14) by 1.43 - Evals for Every Language - Language es (Average Score (%)): Gemini 3.1 Flash Lite (76.16) beat Claude Opus 4.6 (74.74) by 1.42 - AA SciCode (Accuracy (%)): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.19) beat Gemini 3.1 Pro (Preview) (58.91) by 1.28 - Evals for Every Language - Language ace (Average Score (%)): step-3.7-flash-20260528 (72.48) beat Gemini 3.1 Pro (Preview) (71.2) by 1.28 - Evals for Every Language - MMLU (Average Score (%)): intellect-3-20251126 (100.0) beat Claude Sonnet 4.6 (98.73) by 1.27 - Evals for Every Language - ARC (Average Score (%)): intellect-3-20251126 (100.0) beat Gemini 3.1 Pro (Preview) (98.74) by 1.26 - Vals AI LegalBench (Accuracy (%)): Claude 5 (88.56) beat Gemini 3.1 Pro (Preview) (87.4) by 1.16 - Evals for Every Language - Language ca (Average Score (%)): Gemini 3.1 Flash Lite (76.29) beat Gemini 3 Pro (Preview) (75.26) by 1.03 - Opper TaskBench (Avg Task Score (%)): Claude 5 (96.4) beat Claude Opus 4.7 (95.4) by 1.0 - Evals for Every Language - Language ar (Average Score (%)): Claude Opus 4.8 (71.58) beat Claude Opus 4.5 (70.63) by 0.95 - Evals for Every Language - Language en (Average Score (%)): Gemini 3.1 Flash Lite (87.28) beat MiniMax-M2.5 (86.51) by 0.77 - Evals for Every Language - Language cy (Average Score (%)): Gemini 3.1 Flash Lite (82.03) beat Claude Sonnet 4.5 (81.38) by 0.65 - Evals for Every Language - Language am (Average Score (%)): Gemini 3.1 Flash Lite (68.6) beat Claude Opus 4.6 (68.01) by 0.59 - Vals AI MedScribe (Accuracy (%)): Claude 5 (88.52) beat GPT-5.1 (88.09) by 0.43 - Evals for Every Language - Language be (Average Score (%)): Claude Opus 4.8 (69.43) beat Gemini 3.1 Pro (Preview) (69.11) by 0.32 - Evals for Every Language - Language ceb (Average Score (%)): Gemini 3.1 Flash Lite (78.06) beat Gemini 3.1 Pro (Preview) (77.77) by 0.29 - Evals for Every Language - Language el (Average Score (%)): Gemini 3.1 Flash Lite (73.81) beat Claude Opus 4.5 (73.66) by 0.15 - Blueprint-Bench 2 (Connectivity Similarity Score): Claude Fable 5 (0.386) beat GPT-5.5 (0.37) by 0.02 - LiveBench Olympiad (Score): Claude 5 (92.18) beat Claude Opus 4.6 (Thinking) (High) (92.17) by 0.01


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.