AI Benchmark Digest — 2026-06-17
AI Benchmark Digest — 2026-06-17
Daily
New Benchmarks (4)
- LLM Stats (Finance Agent v2) (Score (%)): Gemini 3.5 Flash leads with 57.86 across 25 models.
- LLM Stats (FrontierSWE) (Score (%)): Claude Fable 5 leads with 90.0 across 13 models.
- LLM Stats (Legal Agent Benchmark) (Score (%)): Claude Fable 5 leads with 13.3 across 11 models.
- LLM Stats (SkillsBench) (Score (%)): Qwen3.7 Max leads with 59.2 across 5 models.
New Scores From Top-10 Models (12)
- Claude Fable 5 on SWE-Marathon: 24.0 Pass@1 (%) (#2/11)
- GLM-5.2 on BenchLM: 94.0 Overall Score (#3/124)
- GLM-5.2 on LLM Stats (HMMT 2025): 94.4 Score (%) (#9/33)
- GLM-5.2 on LLM Stats (HMMT Feb 26): 92.5 Score (%) (#6/11)
- GLM-5.2 on LLM Stats (IMO-AnswerBench): 91.0 Score (%) (#2/18)
- GLM-5.2 on LLM Stats (MCP Atlas): 76.8 Score (%) (#4/25)
- GLM-5.2 on LLM Stats (Toolathlon): 48.2 Score (%) (#8/21)
- GLM-5.2 on PinchBench: 87.79 Success Rate (%) (#18/52)
- GLM-5.2 on RuneBench: 3230.0 Total Peak XP Rate (XP/min) (#4/25)
- GLM-5.2 on SWE-Marathon: 13.0 Pass@1 (%) (#4/11)
- GLM-5.2 on ZeroEval GPQA Diamond: 91.2 GPQA Diamond Score (#12/226)
- Qwen 3.7 Max on LLM Stats (GDPval-AA): 1308.0 ELO (#12/33)
New #1 Leaders (15)
- LLM Stats (DeepPlanning) (Score (%)): Qwen 3.7 Plus (62.3) beat Qwen 3.6 Plus (41.5) by 20.8.
- Coding Agent Leaderboard - swe-bench-pro--ansible (Score (%)): Opus 4.8 + Claude Code (69.8) beat Sonnet 4.6 + Claude Code (50.0) by 19.8.
- LLM Stats (MRCR v2) (Score (%)): Qwen 3.7 Plus (91.7) beat U2 (76.61) by 15.09.
- Coding Agent Leaderboard (Score (%)): Opus 4.8 + Claude Code (78.3) beat Sonnet 4.6 + Claude Code (64.8) by 13.5.
- Design Arena (Website) (Elo): silo (1357.0) beat Claude Fable 5 (1345.0) by 12.0.
- Coding Agent Leaderboard - swe-bench-verified (Score (%)): Opus 4.8 + Claude Code (86.8) beat Sonnet 4.6 + Claude Code (79.6) by 7.2.
- LLM Stats (ERQA) (Score (%)): Qwen 3.7 Plus (69.8) beat Qwen 3.6 Plus (65.7) by 4.1.
- LLM Stats (SimpleVQA) (Score (%)): Qwen 3.7 Plus (81.7) beat GLM-5V Turbo (78.2) by 3.5.
- LLM Stats (AIME 2026) (Score (%)): GLM-5.2 (99.2) beat Kimi K2.6 (96.4) by 2.8.
- LLM Stats (IMO-AnswerBench) (Score (%)): Nemotron 3 Ultra (550B A55B) (92.3) beat Qwen 3.7 Max (90.0) by 2.3.
- LLM Stats (NL2Repo) (Score (%)): GLM-5.2 (48.9) beat Qwen 3.7 Max (47.2) by 1.7.
- LLM Stats (RealWorldQA) (Score (%)): Qwen 3.7 Plus (86.9) beat Qwen 3.6 Plus (85.4) by 1.5.
- LLM Stats (LVBench) (Score (%)): Qwen 3.7 Plus (76.2) beat Kimi K2.5 (75.9) by 0.3.
- LLM Stats (Video-MME) (Score (%)): Qwen 3.7 Plus (88.0) beat MiMo-V2.5 (87.7) by 0.3.
- LLM Stats (MLVU) (Score (%)): Qwen 3.7 Plus (87.4) beat Qwen 3.5 122B A10B (87.3) by 0.1.
Don't miss what's next. Subscribe to Mikhail Doroshenko: