AI Benchmark Digest — 2026-04-24
AI Benchmark Digest — 2026-04-24
=== DAILY === NEW BENCHMARKS (8) - MathArena - ARXIVLEAN March (Accuracy (%)): leader Aristotle (17.07), 6 models - Anthropic ECI (AECI) (AECI Score): leader Claude Mythos Preview (159.2), 10 models - LLM Stats (DynaMath) (Score (%)): leader Qwen3.6 Plus (88.0), 5 models - LLM Stats (Finance Agent) (Score (%)): leader Claude Opus 4.7 (64.4), 5 models - LLM Stats (HMMT Feb 26) (Score (%)): leader DeepSeek-V4-Pro-Max (95.2), 7 models - LLM Stats (NL2Repo) (Score (%)): leader GLM-5.1 (42.7), 5 models - Position Bias (Lechmazur) (Order Flip % (lower is better)): leader Xiaomi MiMo V2 Pro (19.8), 27 models - Kaggle DeepSearchQA (Google) (F1 Score (%)): leader Gemini Deep Research Agent (81.9), 13 models
NEW MODELS (10) - DeepSeek V4 Pro (Reasoning, Max Effort) — ELO 1952, #22/1056 (above: Claude Opus 4.6 (Thinking), below: Claude Opus 4.6) - DeepSeek-V4-Pro-Max — ELO 1945, #26/1056 (above: Gemini 3 Flash (High), below: Claude Opus 4.7) LLM Stats (CodeForces): 100.0 (#1/14) LLM Stats (CSimpleQA): 84.4 (#1/7) LLM Stats (IMO-AnswerBench): 89.8 (#1/14) LLM Stats (GDPval-AA): 155400.0 (#3/9) LLM Stats (Toolathlon): 51.8 (#3/18) - DeepSeek V4 Pro (Reasoning, High Effort) — ELO 1931, #34/1056 (above: Claude Opus 4.6 (Adaptive Reasoning, Max Effort), below: Qwen3.6 Max Preview) - DeepSeek V4 Flash (Reasoning, Max Effort) — ELO 1896, #51/1056 (above: Claude Opus 4.7 (Non-reasoning, High Effort), below: Grok 4.20 0309 (Reasoning)) - DeepSeek V4 Flash (Reasoning, High Effort) — ELO 1874, #61/1056 (above: GPT-5 (Thinking), below: Gemini 3 Pro) - DeepSeek-V4-Flash-Max — ELO 1868, #64/1056 (above: GPT-5 (High), below: Kimi K2.5 (Reasoning)) LLM Stats (CodeForces): 100.0 (#2/14) LLM Stats (IMO-AnswerBench): 88.4 (#2/14) - deepseek-v4-pro — ELO 1774, #128/1056 (above: GPT-5.1 Codex Mini (High), below: Gemini 3 Flash (Thinking)) - JSL-MedMNX-7B — ELO 1331, #800/1056 (above: Collaiborator-MEDLLM-Llama-3-8B-v2-6, below: MolmoE-1B) - JSL-MedMNX-7B-SFT — ELO 1328, #810/1056 (above: Llama-3-Orca-1.0-8B, below: Yi-1.5-9B) - Lumina-3.5 — ELO 1301, #874/1056 (above: Slime-7B, below: BioMistral-DARE-NS)
NEW #1 LEADERS (8) - Design Arena (Logo) (Elo): gpt-image-2 (1418.0) beat chestnut (1308.0) by 110.0 - MathVision (Overall Accuracy (%)): GPT-5.4 (xhigh reasoning, w/ Python) (3rd-party eval) 🥇 (96.1) beat Gemini 2.5 Pro 🥇 (73.3) by 22.8 - LLM Stats (CodeForces) (Score (%)): DeepSeek-V4-Pro-Max (100.0) beat DeepSeek-V3.2-Speciale (90.0) by 10.0 - LLM Stats (IMO-AnswerBench) (Score (%)): DeepSeek-V4-Pro-Max (89.8) beat Kimi K2.6 (86.0) by 3.8 - FrontierMath - Tiers 1-3 (Accuracy (%, 290 problems)): GPT-5.5 Pro (high) (52.4) beat GPT-5.4 Pro (xhigh) (50.0) by 2.4 - FrontierMath - Tier 4 (Accuracy (%, 48 problems)): GPT-5.5 (xhigh) (39.6) beat GPT-5.4 Pro (Web App) (37.5) by 2.1 - Open CoT Leaderboard (Average CoT Gain (%)): DeepSeek-R1-Distill-Qwen-14B (17.65) beat DeepSeek-R1-Distill-Qwen-32B (16.92) by 0.73 - LLM Stats (CSimpleQA) (Score (%)): DeepSeek-V4-Pro-Max (84.4) beat Qwen3-235B-A22B-Instruct-2507 (84.3) by 0.1