Mikhail Doroshenko

Archives
Log in
Subscribe
June 13, 2026

AI Benchmark Digest — 2026-06-13

AI Benchmark Digest — 2026-06-13

View on AI Benchmark Hub

Daily

New Scores From Top-10 Models (11)

  • Claude 5 on Chess Puzzles (Epoch AI): 41.0 Accuracy (%) (#8/44)
  • Claude 5 on OTIS Mock AIME 2024-25: 99.72 Accuracy (%) (#3/143)
  • Claude 5 on SimpleQA Verified: 68.3 Accuracy (%) (#4/53)
  • Claude Fable 5 on Epoch AI - Apex Agents: 45.0 Score (#3/46)
  • Claude Fable 5 on Icelandic LLM - ARC-Challenge-IS: 72.95 Score (%) (#59/86)
  • Claude Fable 5 on Icelandic LLM - Belebele-IS: 90.78 Score (%) (#36/86)
  • Claude Fable 5 on Icelandic LLM - Inflection: 97.75 Score (%) (#2/86)
  • Claude Fable 5 on Icelandic LLM - WinoGrande-IS: 96.05 Score (%) (#2/86)
  • Claude Fable 5 on Icelandic LLM Leaderboard - Average: 87.4 Average Score (%) (#4/86)
  • GPT-5.5 on Blueprint-Bench 2: 0.362 Connectivity Similarity Score (#2/14)
  • Qwen 3.7 Max on Wolfram LLM Benchmarking Project: 67.5 Correct Functionality (%) (#14/483)

New #1 Leaders (6)

  • Design Arena (Image) (Elo): riverflow-2.5-pro (1416.0) beat gpt-image-2 (1393.0) by 23.0.
  • LLM Stats (MCP-Mark) (Score (%)): Kimi K2.7 Code (81.1) beat Qwen 3.7 Max (60.8) by 20.3.
  • Icelandic LLM - WikiQA-IS (Score (%)): Claude Fable 5 (75.39) beat Gemini 3.1 Pro (Preview) (67.74) by 7.65.
  • Icelandic LLM - GED (Score (%)): Claude Fable 5 (91.5) beat Claude Opus 4.7 (84.5) by 7.0.
  • BIRD-SQL (Execution Accuracy (%)): Gemini-SQL2 (80.04) beat Gemini-SQL (Multitask SFT + Gemini-2.5-Pro) (77.14) by 2.9.
  • Design Arena (Graphic Design) (Elo): riverflow-2.5-pro (1474.0) beat gpt-image-2 (1473.0) by 1.0.
Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.