AI Benchmark Digest — 2026-06-13
AI Benchmark Digest — 2026-06-13
Daily
New Scores From Top-10 Models (11)
- Claude 5 on Chess Puzzles (Epoch AI): 41.0 Accuracy (%) (#8/44)
- Claude 5 on OTIS Mock AIME 2024-25: 99.72 Accuracy (%) (#3/143)
- Claude 5 on SimpleQA Verified: 68.3 Accuracy (%) (#4/53)
- Claude Fable 5 on Epoch AI - Apex Agents: 45.0 Score (#3/46)
- Claude Fable 5 on Icelandic LLM - ARC-Challenge-IS: 72.95 Score (%) (#59/86)
- Claude Fable 5 on Icelandic LLM - Belebele-IS: 90.78 Score (%) (#36/86)
- Claude Fable 5 on Icelandic LLM - Inflection: 97.75 Score (%) (#2/86)
- Claude Fable 5 on Icelandic LLM - WinoGrande-IS: 96.05 Score (%) (#2/86)
- Claude Fable 5 on Icelandic LLM Leaderboard - Average: 87.4 Average Score (%) (#4/86)
- GPT-5.5 on Blueprint-Bench 2: 0.362 Connectivity Similarity Score (#2/14)
- Qwen 3.7 Max on Wolfram LLM Benchmarking Project: 67.5 Correct Functionality (%) (#14/483)
New #1 Leaders (6)
- Design Arena (Image) (Elo): riverflow-2.5-pro (1416.0) beat gpt-image-2 (1393.0) by 23.0.
- LLM Stats (MCP-Mark) (Score (%)): Kimi K2.7 Code (81.1) beat Qwen 3.7 Max (60.8) by 20.3.
- Icelandic LLM - WikiQA-IS (Score (%)): Claude Fable 5 (75.39) beat Gemini 3.1 Pro (Preview) (67.74) by 7.65.
- Icelandic LLM - GED (Score (%)): Claude Fable 5 (91.5) beat Claude Opus 4.7 (84.5) by 7.0.
- BIRD-SQL (Execution Accuracy (%)): Gemini-SQL2 (80.04) beat Gemini-SQL (Multitask SFT + Gemini-2.5-Pro) (77.14) by 2.9.
- Design Arena (Graphic Design) (Elo): riverflow-2.5-pro (1474.0) beat gpt-image-2 (1473.0) by 1.0.
Don't miss what's next. Subscribe to Mikhail Doroshenko: