Mikhail Doroshenko

Archives
Log in
Subscribe
June 7, 2026

AI Benchmark Digest — 2026-06-07

AI Benchmark Digest — 2026-06-07

=== WEEKLY === NEW MODELS (2) - MiniMax-M3 — ELO 1762, #83/970 (above: Gemini 3 Flash (High), below: Claude Opus 4.5 (Non-reasoning)) LLM Stats (OmniDocBench 1.5): 91.6 (#1/13) LLM Stats (Video-MME): 85.4 (#2/13) OpenClawProBench: 75.1 (#2/65) Vals AI MedScribe: 87.25 (#2/61) AA IFBench: 82.86 (#3/429) LLM Stats (Claw-Eval): 74.5 (#3/9) LLM Stats (NL2Repo): 42.13 (#3/7) AA GPQA Diamond: 92.93 (#4/501) Vals AI CorpFin v2: 68.1 (#4/110) Design Arena (3D): 1348.0 (#5/115) - nemotron-3-ultra-550B-a55B — ELO 1587, #292/970 (above: GLM-4.5, below: Kimi K2 Instruct (0905)) PinchBench: 90.58 (#10/49) Vals AI CorpFin v2: 65.46 (#16/110) Vals AI (Vals Index): 43.99 (#18/24) LiveBench Python: 75.0 (#24/122) LiveBench Paraphrase: 61.15 (#33/122) Vals AI TaxEval v2: 73.1 (#34/116) Bullshit Benchmark: 41.8 (#34/148) Vals AI MedCode: 38.62 (#35/62) AI Chess Leaderboard (Reasoning): 975.0 (#39/277) LiveBench Code Generation: 77.47 (#43/122)

NEW SCORES FROM TOP-10 MODELS (4) - GPT-5.5 (xHigh) on IMO-Bench: 71.9 Advanced ProofBench Accuracy (%) (#4/12) - GPT-5.5 Pro on IUMB: 100.0 Score (%) (#2/55) - GPT-5.5 Pro (xHigh) on IMO-Bench: 88.1 Advanced ProofBench Accuracy (%) (#2/12) - Gemini 3 Deep Think on IUMB: 87.5 Score (%) (#6/55)

NEW #1 LEADERS (10) - EQ-Bench Creative Writing v3 (Elo): Claude Opus 4.7 (2050.8) beat GPT-5.4 (1906.0) by 144.8 - Chatbot Arena (Image-to-Video) (Elo): Grok 1.5 (1473.0) beat dreamina-seedance-2.0-720p (1462.0) by 11.0 - LLM Stats (Multi-Challenge) (Score (%)): Nova 2 Pro (77.7) beat GPT-5 (69.6) by 8.1 - MathArena - Kangaroo 2025 Levels 11-12 (Accuracy (%)): Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) (98.33) by 1.67 - MathArena - APEX 2025 (Accuracy (%)): Claude Opus 4.8 (Thinking) (81.25) beat GPT-5.5 (xHigh) (80.21) by 1.04 - MathArena - Kangaroo 2025 Levels 7-8 (Accuracy (%)): Claude Opus 4.8 (Thinking) (96.67) beat GPT-5.4 (xHigh) (95.83) by 0.84 - MathArena - AIME 2026 (Accuracy (%)): Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) (99.17) by 0.83 - LLM Stats (OmniDocBench 1.5) (Score (%)): MiniMax-M3 (91.6) beat Qwen 3.6 Plus (91.2) by 0.4 - GAIA (Accuracy (%)): CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 (93.02) by 0.34 - ForecastBench (Overall Score (higher is better)): Grok 4.20 (Beta, D) (68.1) beat green-tree (67.9) by 0.2


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.