AI Benchmark Digest — 2026-04-26
AI Benchmark Digest — 2026-04-26
=== DAILY === NEW BENCHMARKS (5) - AI Chess Leaderboard (Continuation) (Elo): leader gemini-3-pro-preview ˟ (1810.0), 214 models - AI Chess Leaderboard (Reasoning) (Elo): leader gemini-3.1-pro-preview (1869.0), 265 models - LLM Emergent Collusion (Collusion Rate (%)): leader Grok 4 (0709) (75.0), 13 models - LLM Public Goods Game (Avg. Contribution (%)): leader Gemini 2.0 Flash Exp (45.2), 21 models - Story Theory Bench (Score (%)): leader deepseek-v3.2 (92.2), 25 models
NEW MODELS (1) - DeepSeek-v4-Pro (Max) — ELO 1963, #19/1059 (above: GPT-5.2 (xHigh), below: GPT-5.4 (High)) MathArena - ArXiv Math Jan 2026: 73.91 (#2/28) MathArena - APEX Shortlist 2025: 86.46 (#3/32)
NEW #1 LEADERS (4) - UGI - Natural Intelligence (NatInt Score): gpt-5.5-2026-04-23 (reasoning_effort=high) (79.27) beat gemini-3.1-pro-preview (thinking_level=medium) (76.44) by 2.83 - Design Arena (Game Dev) (Elo): claude-opus-4-7 (1359.0) beat claude-opus-4-6-thinking (1357.0) by 2.0 - EQ-Bench Longform Writing (Writing Score (0-100)): claude-opus-4-7 (81.8) beat claude-sonnet-4-6 (79.9) by 1.9 - Design Arena (3D) (Elo): kimi-k2.6 (1368.0) beat claude-opus-4-6 (1367.0) by 1.0