Mikhail Doroshenko

Archives
Log in
May 14, 2026

AI Benchmark Digest — 2026-05-14

AI Benchmark Digest — 2026-05-14

=== DAILY === NEW MODELS (4) - Doubao-Seed-2-0-Pro-260215 (High) — ELO 1781, #73/796 (above: GPT-5.2 (Low), below: GLM-5-Turbo) OpenCompass LLM - Reasoning: 65.2 (#1/23) OpenCompass LLM - Math: 77.3 (#1/23) OpenCompass Knowledge - Humanities: 95.0 (#1/23) OpenCompass Reasoning - Common: 82.1 (#1/23) OpenCompass Math - College: 83.8 (#1/23) OpenCompass LLM - Language: 77.3 (#3/23) OpenCompass Language - Creation: 77.1 (#3/23) OpenCompass Knowledge - Science: 94.6 (#3/23) OpenCompass LLM - Agent: 44.2 (#4/23) OpenCompass Language - NLP: 69.6 (#4/23) - Doubao-Seed-2-0-Lite-260215 (High) — ELO 1741, #103/796 (above: O3 Pro, below: DeepSeek V3.2 Speciale) OpenCompass Reasoning - Common: 78.1 (#2/23) OpenCompass Language - Creation: 77.1 (#4/23) OpenCompass LLM - Language: 74.4 (#6/23) OpenCompass LLM - Agent: 42.4 (#6/23) OpenCompass Agent - Tool Use: 42.4 (#6/23) OpenCompass Knowledge - Science: 91.7 (#7/23) OpenCompass LLM - Reasoning: 59.5 (#8/23) OpenCompass Language - NLP: 67.1 (#8/23) OpenCompass Language - Instruction Following: 72.5 (#8/23) OpenCompass Math - College: 77.1 (#8/23) - Hy3-preview (High) — ELO 1729, #110/796 (above: GPT-5.4 Mini (Medium), below: Kimi K2.6 (Non-reasoning)) OpenCompass Math - College: 81.3 (#3/23) OpenCompass Language - Instruction Following: 76.0 (#4/23) OpenCompass LLM - Math: 74.5 (#5/23) OpenCompass Language - Creation: 75.4 (#5/23) OpenCompass LLM - Language: 74.4 (#7/23) OpenCompass Reasoning - Academic: 43.6 (#8/23) OpenCompass LLM - Reasoning: 58.5 (#10/23) OpenCompass Math - Competition: 67.6 (#10/23) OpenCompass LLM - Agent: 28.7 (#12/23) OpenCompass Reasoning - Common: 73.5 (#12/23) - Ring-2.5-1T — ELO 1711, #119/796 (above: DeepSeek V4 Flash, below: Claude Sonnet 4 (Thinking 16K)) OpenCompass Knowledge - Social Science: 92.9 (#5/23) OpenCompass Language - NLP: 65.4 (#11/23) OpenCompass Language - Creation: 68.8 (#12/23) OpenCompass Knowledge - Humanities: 90.0 (#12/23) OpenCompass LLM - Agent: 25.0 (#13/23) OpenCompass Math - College: 75.0 (#13/23) OpenCompass Agent - Tool Use: 25.0 (#13/23) OpenCompass LLM - Knowledge: 89.4 (#14/23) OpenCompass Knowledge - Engineering: 90.8 (#14/23) OpenCompass LLM - Language: 69.8 (#15/23)

NEW SCORES FROM TOP-10 MODELS (1) - Claude Opus 4.7 (Thinking) on WeirdML: 75.45 Average Score (#8/123)

NEW #1 LEADERS (9) - OpenCompass Reasoning - Common (Score (%)): Doubao-Seed-2-0-Pro-260215 (high) (82.1) beat Gemini-3-Pro-Preview (73.6) by 8.5 - OpenCompass Math - College (Score (%)): Doubao-Seed-2-0-Pro-260215 (high) (83.8) beat Kimi-K2.5 (76.5) by 7.3 - Tau3-Bench Banking_Knowledge (Pass@1 (%)): GPT-5.5 (37.4) beat Distyl ButtonAgent (31.2) by 6.2 - OpenCompass Knowledge - Social Science (Score (%)): Gemini-3.1-Pro-Preview (97.5) beat Gemini-3-Pro-Preview (93.2) by 4.3 - OpenCompass LLM - Math (Score (%)): Doubao-Seed-2-0-Pro-260215 (high) (77.3) beat Qwen3-Max-2026-01-23 (73.2) by 4.1 - OpenCompass LLM - Reasoning (Score (%)): Doubao-Seed-2-0-Pro-260215 (high) (65.2) beat Gemini-3-Pro-Preview (61.5) by 3.7 - OpenCompass Math - Competition (Score (%)): Kimi-K2.6 (72.1) beat Qwen3-Max-2026-01-23 (70.0) by 2.1 - OpenCompass Reasoning - Academic (Score (%)): GPT-5.4-2026-03-05 (high) (52.0) beat GPT-5.2-2025-12-11 (high) (50.5) by 1.5 - OpenCompass Knowledge - Engineering (Score (%)): GPT-5.4-2026-03-05 (high) (96.2) beat Gemini-3-Pro-Preview (95.8) by 0.4


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.