Mikhail Doroshenko

Archives
Log in
May 20, 2026

AI Benchmark Digest — 2026-05-20

AI Benchmark Digest — 2026-05-20

=== DAILY === NEW MODELS (1) - Gemini 3.5 Flash (High) — ELO 1942, #9/609 (above: Claude Opus 4.7 (Thinking), below: GPT-5.5 (High)) AA MMMU-Pro: 84.28 (#1/190) SEAL - MCP Atlas: 83.6 (#1/21) AA Omniscience: 22.68 (#3/393) AA Omniscience - Law: 57.4 (#4/393) AA Omniscience - Software Engineering (SWE) - PHP: 84.0 (#4/393) AA Humanity's Last Exam: 40.96 (#5/484) AA GPQA Diamond: 92.22 (#6/488) AA Omniscience - Science, Engineering & Mathematics: 50.1 (#6/393) AA GDPval: 1655.7 (#7/365) AA Omniscience - Humanities & Social Sciences: 52.3 (#7/393)

NEW SCORES FROM TOP-10 MODELS (34) - GPT-5.5 (High) on Multi-turn Debate (Lechmazur): 1583.6 Bradley-Terry Rating (#5/29) - Gemini 3.5 Flash (High) on AA CritPt: 13.14 Accuracy (%) (#8/393) - Gemini 3.5 Flash (High) on AA GDPval: 1655.7 ELO (#7/365) - Gemini 3.5 Flash (High) on AA GPQA Diamond: 92.22 Accuracy (%) (#6/488) - Gemini 3.5 Flash (High) on AA Humanity's Last Exam: 40.96 Accuracy (%) (#5/484) - Gemini 3.5 Flash (High) on AA IFBench: 76.33 Accuracy (%) (#17/416) - Gemini 3.5 Flash (High) on AA Long Context Reasoning: 69.33 Accuracy (%) (#27/416) - Gemini 3.5 Flash (High) on AA Omniscience: 22.68 Score (#3/393) - Gemini 3.5 Flash (High) on AA Omniscience - Business: 45.8 Accuracy (%) (#8/393) - Gemini 3.5 Flash (High) on AA Omniscience - Health: 40.2 Accuracy (%) (#14/393) - Gemini 3.5 Flash (High) on AA Omniscience - Humanities & Social Sciences: 52.3 Accuracy (%) (#7/393) - Gemini 3.5 Flash (High) on AA Omniscience - Law: 57.4 Accuracy (%) (#4/393) - Gemini 3.5 Flash (High) on AA Omniscience - Science, Engineering & Mathematics: 50.1 Accuracy (%) (#6/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE): 65.5 Accuracy (%) (#16/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - C: 80.0 Accuracy (%) (#18/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - Dart: 60.0 Accuracy (%) (#14/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - Go: 50.0 Accuracy (%) (#32/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - HTML: 72.0 Accuracy (%) (#17/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - Java: 51.0 Accuracy (%) (#16/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - JavaScript: 71.82 Accuracy (%) (#14/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - Julia: 60.0 Accuracy (%) (#13/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - Kotlin: 56.0 Accuracy (%) (#22/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - PHP: 84.0 Accuracy (%) (#4/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - Python: 61.0 Accuracy (%) (#24/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - R: 56.0 Accuracy (%) (#18/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - Rust: 80.0 Accuracy (%) (#8/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - Swift: 72.0 Accuracy (%) (#20/393) - Gemini 3.5 Flash (High) on AA Omniscience - Software Engineering (SWE) - TypeScript: 67.78 Accuracy (%) (#16/393) - Gemini 3.5 Flash (High) on AA SciCode: 53.12 Accuracy (%) (#11/482) - Gemini 3.5 Flash (High) on AA TAU-2 Bench: 95.32 Accuracy (%) (#20/407) - Gemini 3.5 Flash (High) on AA Terminal-Bench Hard: 40.91 Accuracy (%) (#36/402) - Gemini 3.5 Flash (High) on ARC-AGI-1: 92.5 Accuracy (%) (#16/143) - Gemini 3.5 Flash (High) on ARC-AGI-2: 72.08 Accuracy (%) (#12/146) - Gemini 3.5 Flash (High) on Artificial Analysis Intelligence Index: 55.33 Intelligence Index (#8/487)

NEW #1 LEADERS (5) - LLM Stats (GDPval-AA) (Score (%)): Gemini 3.5 Flash (165600.0) beat Claude Sonnet 4.6 (163300.0) by 2300.0 - LLM Stats (MCP Atlas) (Score (%)): Gemini 3.5 Flash (83.6) beat Claude Opus 4.7 (77.3) by 6.3 - AA MMMU-Pro (Accuracy (%)): Gemini 3.5 Flash (high) (84.28) beat Gemini 3.1 Pro Preview (82.43) by 1.85 - SEAL - MCP Atlas (Score): gemini-3.5-flash (high) (83.6) beat Muse Spark (82.2) by 1.4 - LLM Stats (Toolathlon) (Score (%)): Gemini 3.5 Flash (56.5) beat GPT-5.5 (55.6) by 0.9


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.