Mikhail Doroshenko

Archives
Log in
May 17, 2026

AI Benchmark Digest — 2026-05-17

AI Benchmark Digest — 2026-05-17

=== DAILY === NEW #1 LEADERS (1) - OpenClawProBench (Overall Score (%)): intern-s2-preview (76.7) beat Sensenova 6.7 Flash Lite (73.7) by 3.0

=== WEEKLY === NEW SCORES FROM TOP-10 MODELS (3) - Claude Opus 4.7 (Thinking) on SEAL Showdown: 1115.7 Arena Score (#12/47) - Claude Opus 4.7 (Thinking) on WeirdML: 75.45 Average Score (#8/123) - GPT-5.5 (xHigh) on Chatbot Arena (Code): 1501.0 Elo (#9/79)

NEW #1 LEADERS (16) - MathArena - ARXIVLEAN March (Accuracy (%)): AlephProver (34.15) beat Aristotle (17.07) by 17.08 - OpenCompass Reasoning - Common (Score (%)): Doubao-Seed-2-0-Pro-260215 (high) (82.1) beat Gemini-3-Pro-Preview (73.6) by 8.5 - OpenCompass Math - College (Score (%)): Doubao-Seed-2-0-Pro-260215 (high) (83.8) beat Kimi-K2.5 (76.5) by 7.3 - OpenClawProBench (Overall Score (%)): intern-s2-preview (76.7) beat qwen3.5-397b-a17b (70.4) by 6.3 - Tau3-Bench Banking_Knowledge (Pass@1 (%)): GPT-5.5 (37.4) beat Distyl ButtonAgent (31.2) by 6.2 - OpenCompass Knowledge - Social Science (Score (%)): Gemini-3.1-Pro-Preview (97.5) beat Gemini-3-Pro-Preview (93.2) by 4.3 - OpenCompass LLM - Math (Score (%)): Doubao-Seed-2-0-Pro-260215 (high) (77.3) beat Qwen3-Max-2026-01-23 (73.2) by 4.1 - OpenCompass LLM - Reasoning (Score (%)): Doubao-Seed-2-0-Pro-260215 (high) (65.2) beat Gemini-3-Pro-Preview (61.5) by 3.7 - OpenCompass Math - Competition (Score (%)): Kimi-K2.6 (72.1) beat Qwen3-Max-2026-01-23 (70.0) by 2.1 - OpenCompass Reasoning - Academic (Score (%)): GPT-5.4-2026-03-05 (high) (52.0) beat GPT-5.2-2025-12-11 (high) (50.5) by 1.5 - VisuLogic (Overall Accuracy (%)): PEREA-1.0new (52.8) beat Human (51.4) by 1.4 - WeirdML (Average Score): gpt-5.5 (xhigh) (84.91) beat gpt-5.5 (high) (83.9) by 1.01 - GAIA (Accuracy (%)): Co-Sight Pro v1.0.1 (93.02) beat OPS-Agentic-Search (92.36) by 0.66 - OpenCompass Knowledge - Engineering (Score (%)): GPT-5.4-2026-03-05 (high) (96.2) beat Gemini-3-Pro-Preview (95.8) by 0.4 - AA TAU-2 Bench (Accuracy (%)): JT-35B-Flash (99.12) beat GLM-4.7-Flash (Reasoning) (98.8) by 0.32 - AISI Cyber TLO 10M (Avg Steps (/32)): GPT-5.5 (10.0) beat Claude Opus 4.6 (9.8) by 0.2


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.