AI Benchmark Digest — 2026-04-21
AI Benchmark Digest — 2026-04-21
=== DAILY === NEW BENCHMARKS (4) - ReasonScape R12 (ReasonScore): leader Qwen3.5-397B-A17B (AWQ, 16k) Thinking (951.64), 67 models - LLM Stats (DeepSearchQA) (Score (%)): leader Claude Opus 4.6 (91.3), 5 models - LLM Stats (MCP-Mark) (Score (%)): leader Kimi K2.6 (55.9), 5 models - FrontierSWE (Dominance (%)): leader GPT-5.4 (74.0), 5 models
NEW MODELS (6) - Claude Opus 4.7 (Thinking) — ELO 2092, #4/886 (above: GPT-5.4 Pro (xHigh), below: Claude Mythos Preview) - Kimi K2.6 — ELO 1923, #32/886 (above: GPT-5.3 Codex (xHigh), below: Muse Spark) BridgeBench Debugging: 87.4 (#1/20) LLM Stats (AIME 2026): 96.4 (#1/11) LLM Stats (Claw-Eval): 80.9 (#1/6) LLM Stats (IMO-AnswerBench): 86.0 (#1/11) LLM Stats (MathVision): 93.2 (#1/26) LLM Stats (OJBench): 60.6 (#1/9) LLM Stats (V*): 96.9 (#1/6) LLM Stats (WideSearch): 80.8 (#1/8) LLM Stats (BrowseComp): 86.3 (#2/41) LLM Stats (Toolathlon): 50.0 (#2/15) - JT-MINI — ELO 1652, #223/886 (above: Kimi K2 0905, below: v0-1.5-md) - Collaiborator-MEDLLM-Llama-3-8B-v2-5 — ELO 1346, #657/886 (above: Collaiborator-MEDLLM-Llama-3-8B, below: Collaiborator-MEDLLM-Llama-3-8B-v2-6) - ClinicalGPT-base-zh — ELO 1066, #866/886 (above: GPT-2 XL, below: pythia-2.8B-deduped) - GPT2_PMC — ELO 1043, #880/886 (above: Healix-1.1B-V1-Chat-dDPO, below: mega-ar-525m-v0.07-ultraTBfw)
NEW #1 LEADERS (11) - Multi-turn Debate (Lechmazur) (Bradley-Terry Rating): Claude Opus 4.7 (high reasoning) (1718.8) beat Claude Sonnet 4.6 (high reasoning) (1617.5) by 101.3 - LLM Stats (OJBench) (Score (%)): Kimi K2.6 (60.6) beat Kimi K2-Thinking-0905 (48.7) by 11.9 - LLM Stats (Claw-Eval) (Score (%)): Kimi K2.6 (80.9) beat GLM-5V-Turbo (75.0) by 5.9 - LLM Stats (MathVision) (Score (%)): Kimi K2.6 (93.2) beat Qwen3.6 Plus (88.0) by 5.2 - Design Arena (Game Dev) (Elo): claude-opus-4-7 (1364.0) beat claude-opus-4-6-thinking (1362.0) by 2.0 - LLM Stats (WideSearch) (Score (%)): Kimi K2.6 (80.8) beat Kimi K2.5 (79.0) by 1.8 - OTIS Mock AIME 2024-25 (Accuracy (%)): claude-opus-4-7_xhigh (97.8) beat gpt-5.2-2025-12-11_high (96.11) by 1.69 - Open Arabic LLM Leaderboard (Average Score (%)): Qwen3-8B-SFT-V2 (80.49) beat Karnak (79.37) by 1.12 - LLM Stats (AIME 2026) (Score (%)): Kimi K2.6 (96.4) beat GLM-5.1 (95.3) by 1.1 - LLM Stats (IMO-AnswerBench) (Score (%)): Kimi K2.6 (86.0) beat Step-3.5-Flash (85.4) by 0.6 - BridgeBench Debugging (Score): Kimi K2.6 (87.4) beat Claude Opus 4.6 (87.0) by 0.4