Mikhail Doroshenko

Archives
Log in
Subscribe
June 17, 2026

AI Benchmark Digest — 2026-06-17

AI Benchmark Digest — 2026-06-17

View on AI Benchmark Hub

Daily

New Benchmarks (4)

  • LLM Stats (Finance Agent v2) (Score (%)): Gemini 3.5 Flash leads with 57.86 across 25 models.
  • LLM Stats (FrontierSWE) (Score (%)): Claude Fable 5 leads with 90.0 across 13 models.
  • LLM Stats (Legal Agent Benchmark) (Score (%)): Claude Fable 5 leads with 13.3 across 11 models.
  • LLM Stats (SkillsBench) (Score (%)): Qwen3.7 Max leads with 59.2 across 5 models.

New Scores From Top-10 Models (12)

  • Claude Fable 5 on SWE-Marathon: 24.0 Pass@1 (%) (#2/11)
  • GLM-5.2 on BenchLM: 94.0 Overall Score (#3/124)
  • GLM-5.2 on LLM Stats (HMMT 2025): 94.4 Score (%) (#9/33)
  • GLM-5.2 on LLM Stats (HMMT Feb 26): 92.5 Score (%) (#6/11)
  • GLM-5.2 on LLM Stats (IMO-AnswerBench): 91.0 Score (%) (#2/18)
  • GLM-5.2 on LLM Stats (MCP Atlas): 76.8 Score (%) (#4/25)
  • GLM-5.2 on LLM Stats (Toolathlon): 48.2 Score (%) (#8/21)
  • GLM-5.2 on PinchBench: 87.79 Success Rate (%) (#18/52)
  • GLM-5.2 on RuneBench: 3230.0 Total Peak XP Rate (XP/min) (#4/25)
  • GLM-5.2 on SWE-Marathon: 13.0 Pass@1 (%) (#4/11)
  • GLM-5.2 on ZeroEval GPQA Diamond: 91.2 GPQA Diamond Score (#12/226)
  • Qwen 3.7 Max on LLM Stats (GDPval-AA): 1308.0 ELO (#12/33)

New #1 Leaders (15)

  • LLM Stats (DeepPlanning) (Score (%)): Qwen 3.7 Plus (62.3) beat Qwen 3.6 Plus (41.5) by 20.8.
  • Coding Agent Leaderboard - swe-bench-pro--ansible (Score (%)): Opus 4.8 + Claude Code (69.8) beat Sonnet 4.6 + Claude Code (50.0) by 19.8.
  • LLM Stats (MRCR v2) (Score (%)): Qwen 3.7 Plus (91.7) beat U2 (76.61) by 15.09.
  • Coding Agent Leaderboard (Score (%)): Opus 4.8 + Claude Code (78.3) beat Sonnet 4.6 + Claude Code (64.8) by 13.5.
  • Design Arena (Website) (Elo): silo (1357.0) beat Claude Fable 5 (1345.0) by 12.0.
  • Coding Agent Leaderboard - swe-bench-verified (Score (%)): Opus 4.8 + Claude Code (86.8) beat Sonnet 4.6 + Claude Code (79.6) by 7.2.
  • LLM Stats (ERQA) (Score (%)): Qwen 3.7 Plus (69.8) beat Qwen 3.6 Plus (65.7) by 4.1.
  • LLM Stats (SimpleVQA) (Score (%)): Qwen 3.7 Plus (81.7) beat GLM-5V Turbo (78.2) by 3.5.
  • LLM Stats (AIME 2026) (Score (%)): GLM-5.2 (99.2) beat Kimi K2.6 (96.4) by 2.8.
  • LLM Stats (IMO-AnswerBench) (Score (%)): Nemotron 3 Ultra (550B A55B) (92.3) beat Qwen 3.7 Max (90.0) by 2.3.
  • LLM Stats (NL2Repo) (Score (%)): GLM-5.2 (48.9) beat Qwen 3.7 Max (47.2) by 1.7.
  • LLM Stats (RealWorldQA) (Score (%)): Qwen 3.7 Plus (86.9) beat Qwen 3.6 Plus (85.4) by 1.5.
  • LLM Stats (LVBench) (Score (%)): Qwen 3.7 Plus (76.2) beat Kimi K2.5 (75.9) by 0.3.
  • LLM Stats (Video-MME) (Score (%)): Qwen 3.7 Plus (88.0) beat MiMo-V2.5 (87.7) by 0.3.
  • LLM Stats (MLVU) (Score (%)): Qwen 3.7 Plus (87.4) beat Qwen 3.5 122B A10B (87.3) by 0.1.
Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.