Mikhail Doroshenko

Archives
April 9, 2026

AI Benchmark Digest — 2026-04-09

AI Benchmark Digest — 2026-04-09

=== DAILY === NEW BENCHMARKS (25) - SWE-Arena (Elo Score): leader Voxtral Small 24B 2507 (1004.0), 37 models - Long Code Arena (Mean Score (%)): leader GPT-o1 (95.7), 9 models - DuckDB-NSQL (Execution Accuracy (%)): leader deepseek-chat (80.0), 99 models - MEGA-Bench (Overall Score (%)): leader Gemini-2.5-pro-0325 (64.7), 44 models - Video-MME-v2 (Avg Accuracy w/o sub (%)): leader Gemini-3-Pro (56.8), 47 models - YC-Bench (Net Worth ($K)): leader Claude Opus 4.6 (1269.7), 13 models - OpenClawProBench (Overall Score (%)): leader qwen3.5-plus (70.1), 34 models - DABstep (Hard Level Accuracy (%)): leader claude haiku 4.5 (89.95), 28 models - LMGame-Bench Super Mario Bros (Score): leader o3-2025-04-16 (3445.0), 12 models - LMGame-Bench 2048 (Score): leader o1-2024-12-17 (7580.0), 25 models - LMGame-Bench Tetris (Score): leader grok-4-0709 (125.7), 25 models - LMGame-Bench Candy Crush (Score): leader o3-2025-04-16 (647.0), 25 models - LMGame-Bench Sokoban (Score): leader gpt-5-thinking-high (11.0), 25 models - LMGame-Bench Ace Attorney (Score): leader o1-2024-12-17 (16.0), 17 models - UX Leaderboard (Overall Score): leader gpt-4o (87.8), 6 models - Open Medical LLM (Average Accuracy (%)): leader medllama3-v20 (90.01), 181 models - Open Medical LLM - MedMCQA (Accuracy (%)): leader medllama3-v20 (75.4), 181 models - Open Medical LLM - MedQA (USMLE) (Accuracy (%)): leader medllama3-v20 (81.07), 181 models - Open Medical LLM - MMLU Anatomy (Accuracy (%)): leader medllama3-v20 (91.85), 181 models - Open Medical LLM - MMLU Clinical Knowledge (Accuracy (%)): leader medllama3-v20 (95.85), 181 models - Open Medical LLM - MMLU College Biology (Accuracy (%)): leader medllama3-v20 (98.61), 181 models - Open Medical LLM - MMLU College Medicine (Accuracy (%)): leader medllama3-v20 (94.8), 181 models - Open Medical LLM - MMLU Medical Genetics (Accuracy (%)): leader medllama3-v20 (98.0), 181 models - Open Medical LLM - MMLU Professional Medicine (Accuracy (%)): leader medllama3-v20 (98.9), 181 models - Open Medical LLM - PubMedQA (Accuracy (%)): leader Flan-PaLM (79.0), 181 models

NEW MODELS (1) - muse_spark — ELO 1824, #60/947 (above: GPT-5 Mini (High), below: O3 Pro) Vals AI TaxEval v2: 77.68 (#1/102) Vals AI Finance Agent: 60.6 (#2/42) Vals AI (Vals Index): 65.66 (#3/37) Vals AI Terminal-Bench 2.0: 59.55 (#3/49)

NEW #1 LEADERS (9) - SEAL - SWE-Bench Pro (Private Dataset) (Score): claude-opus-4-6 (thinking) (47.1) beat gpt-5.2-2025-12-11 (23.81) by 23.29 - Design Arena (Slides) (Elo): honeydew (1270.0) beat gamma (1253.0) by 17.0 - SEAL - SWE-Bench Pro (Public Dataset) (Score): gpt-5.4-pro (xHigh)* (59.1) beat claude-opus-4-5-20251101 (45.89) by 13.21 - SEAL - TutorBench (Score): Muse Spark (68.55) beat gpt-5.4-pro-2026-03-05 (56.62) by 11.93 - LLM Stats (CharXiv-R) (Score (%)): Claude Mythos Preview (93.2) beat GPT-5.2 (82.1) by 11.1 - SEAL - MultiChallenge (Score): Muse Spark (75.52) beat gemini-3.1-pro-preview (71.37) by 4.15 - LLM Stats (BrowseComp) (Score (%)): Claude Mythos Preview (86.9) beat Gemini 3.1 Pro (85.9) by 1.0 - ZeroEval GPQA Diamond (GPQA Diamond Score): Claude Mythos Preview (94.6) beat Gemini 3.1 Pro (94.3) by 0.3 - LLM Stats (MMMLU) (Score (%)): Claude Mythos Preview (92.7) beat Gemini 3.1 Pro (92.6) by 0.1


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.