AI Benchmark Digest — 2026-04-23
AI Benchmark Digest — 2026-04-23
=== DAILY === NEW BENCHMARKS (8) - MathArena - ARXIVLEAN March (Accuracy (%)): leader Aristotle (17.07), 6 models - Anthropic ECI (AECI) (AECI Score): leader Claude Mythos Preview (159.2), 10 models - LLM Stats (DynaMath) (Score (%)): leader Qwen3.6 Plus (88.0), 5 models - LLM Stats (Finance Agent) (Score (%)): leader Claude Opus 4.7 (64.4), 5 models - LLM Stats (HMMT Feb 26) (Score (%)): leader Kimi K2.6 (92.7), 5 models - LLM Stats (NL2Repo) (Score (%)): leader GLM-5.1 (42.7), 5 models - Position Bias (Lechmazur) (Order Flip % (lower is better)): leader Xiaomi MiMo V2 Pro (19.8), 27 models - Kaggle DeepSearchQA (Google) (F1 Score (%)): leader Gemini Deep Research Agent (81.9), 13 models
NEW MODELS (204) - GPT-5.5 (xHigh) — ELO 2054, #6/1064 (above: GPT-5.4 Pro (xHigh), below: Gemini 3 Pro (Thinking)) ARC-AGI-2: 85.0 (#2/144) - GPT-5.5 (Medium) — ELO 2031, #9/1064 (above: GPT-5.4 (xHigh), below: GPT-5.5 (High)) - GPT-5.5 (High) — ELO 2022, #10/1064 (above: GPT-5.5 (Medium), below: GPT-5 (Thinking, High)) - Claude Opus 4.7 (Adaptive Reasoning, Max Effort) — ELO 2003, #13/1064 (above: GPT-5.4 (Medium), below: Gemini 3.1 Pro (High)) AA Omniscience: 26.17 (#2/367) Artificial Analysis Intelligence Index: 57.28 (#3/461) AA GDPval: 1752.68 (#3/339) - GPT-5.5 (Low) — ELO 1938, #27/1064 (above: Gemini 3.1 Pro, below: GPT-5.2 Pro) - Kimi K2.6 (Think) — ELO 1936, #29/1064 (above: GPT-5.2 Pro, below: GPT-5.2 (High)) MathArena - ARXIV_FALSE March: 15.62 (#2/8) MathArena - ArXiv Math Jan 2026: 71.74 (#3/26) MathArena - Usamo 2026: 51.19 (#3/7) - Claude Opus 4.6 (Adaptive Reasoning, Max Effort) — ELO 1935, #31/1064 (above: GPT-5.2 (High), below: Qwen3.6 Max Preview) AA APEX-Agents: 33.04 (#3/18) - GLM-5.1 (Reasoning) — ELO 1927, #34/1064 (above: Kimi K2.6, below: Claude Sonnet 4.6 (Thinking)) - MiMo-V2.5-Pro — ELO 1911, #42/1064 (above: Gemini 3 Pro (High), below: GPT-5 Pro) - Grok 4.20 0309 v2 (Reasoning) — ELO 1903, #45/1064 (above: GPT-5.3 Codex (xHigh), below: Muse Spark) AA IFBench: 81.2 (#2/390) - Claude Opus 4.7 (Non-reasoning, High Effort) — ELO 1902, #47/1064 (above: Muse Spark, below: Grok 4.20 0309 (Reasoning)) - GLM-5 (Reasoning) — ELO 1896, #49/1064 (above: Grok 4.20 0309 (Reasoning), below: Gemini 3.1 Pro (Thinking)) - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) — ELO 1893, #51/1064 (above: Gemini 3.1 Pro (Thinking), below: Claude Opus 4.5 (Reasoning)) - Claude Opus 4.5 (Reasoning) — ELO 1888, #52/1064 (above: Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort), below: O3 Pro (High)) AA MMLU-Pro: 89.5 (#3/345) - Kimi K2.5 (Reasoning) — ELO 1871, #59/1064 (above: GPT-5 (High), below: GPT-5 Codex (High)) - GPT-5.5 — ELO 1861, #64/1064 (above: GPT-5.4 (Low), below: GPT-5.4) Vellum - ARC-AGI-2: 85.0 (#1/12) LLM Stats (Toolathlon): 55.6 (#1/16) LLM Stats (CyberGym): 81.8 (#2/6) LLM Stats (Graphwalks parents >128k): 58.5 (#2/6) LLM Stats (MCP Atlas): 75.3 (#2/15) LLM Stats (MRCR v2 (8-needle)): 74.0 (#2/9) LLM Stats (OSWorld-Verified): 78.7 (#2/12) Vellum - GPQA: 93.6 (#3/54) Vending-Bench 2: 7523.84 (#3/35) LLM Stats (Graphwalks BFS >128k): 45.4 (#3/7) - GLM-4.7 (Reasoning) — ELO 1856, #69/1064 (above: GPT-5.4 Nano (xHigh), below: Qwen3.5 397B A17B (Reasoning)) - Qwen3.5 397B A17B (Reasoning) — ELO 1855, #70/1064 (above: GLM-4.7 (Reasoning), below: GPT-5.1 Codex) - Claude Opus 4.6 (Non-reasoning, High Effort) — ELO 1845, #78/1064 (above: GPT-5.1 (Medium), below: kimi-k2.6 (Thinking)) - Qwen3.6 27B (Reasoning) — ELO 1839, #80/1064 (above: kimi-k2.6 (Thinking), below: DeepSeek V3.2 (Reasoning)) - DeepSeek V3.2 (Reasoning) — ELO 1835, #81/1064 (above: Qwen3.6 27B (Reasoning), below: MiMo-V2-Flash (Reasoning)) - MiMo-V2-Flash (Reasoning) — ELO 1835, #82/1064 (above: DeepSeek V3.2 (Reasoning), below: MiMo-V2-Flash (Feb 2026)) - GLM 5V Turbo (Reasoning) — ELO 1827, #85/1064 (above: GLM-5, below: nova-2-lite-v1) AA TAU-2 Bench: 98.5 (#2/381) - GLM-5.1 (Non-reasoning) — ELO 1822, #87/1064 (above: nova-2-lite-v1, below: Kimi K2.5 (Thinking)) - Qwen3.6 35B A3B (Reasoning) — ELO 1814, #92/1064 (above: GPT-5.4 Mini (High), below: O3 Pro) - Qwen3.5 27B (Reasoning) — ELO 1811, #95/1064 (above: GPT-5 (Low), below: Qwen3.5 122B A10B (Reasoning)) - Qwen3.5 122B A10B (Reasoning) — ELO 1808, #96/1064 (above: Qwen3.5 27B (Reasoning), below: KAT Coder Pro V2) - KAT Coder Pro V2 — ELO 1808, #97/1064 (above: Qwen3.5 122B A10B (Reasoning), below: O3 (Medium)) - Claude Sonnet 4.6 (Non-reasoning, High Effort) — ELO 1805, #100/1064 (above: Qwen 3.6 Plus, below: O3 (High)) - Claude Sonnet 4.6 (Non-reasoning, Low Effort) — ELO 1788, #112/1064 (above: Step 3.5 Flash 2603, below: GPT-5.2) - Claude Opus 4.5 (Non-reasoning) — ELO 1785, #116/1064 (above: Grok 4.1 Fast (Reasoning), below: Gemma 4 31B (Reasoning)) - Gemma 4 31B (Reasoning) — ELO 1785, #117/1064 (above: Claude Opus 4.5 (Non-reasoning), below: GPT-5.4 Mini (Medium)) - Qwen3.5 35B A3B (Reasoning) — ELO 1774, #127/1064 (above: Qwen 3.5 397B A17B, below: GPT-5 Mini (Medium)) - GLM-5 (Non-reasoning) — ELO 1771, #129/1064 (above: GPT-5 Mini (Medium), below: Kimi K2.5) - Qwen3.5 397B A17B (Non-reasoning) — ELO 1767, #131/1064 (above: Kimi K2.5, below: O3 Pro (Medium)) - DeepSeek V3.1 Terminus (Reasoning) — ELO 1758, #141/1064 (above: Claude Sonnet 4 (Reasoning), below: GLM-5 (Thinking)) - NVIDIA Nemotron 3 Super 120B A12B (Reasoning) — ELO 1751, #150/1064 (above: Claude Opus 4.1 (Reasoning), below: Grok 4 Fast R) - DeepSeek V3.2 Exp (Reasoning) — ELO 1750, #152/1064 (above: Grok 4 Fast R, below: GPT-5.4 Nano (Medium)) - Kimi K2.5 (Non-reasoning) — ELO 1749, #155/1064 (above: Gemini 2.5 Pro (Thinking), below: Qwen 3 Max (Thinking)) - Qwen3.5 122B A10B (Non-reasoning) — ELO 1747, #158/1064 (above: Qwen 3.5 Plus (Thinking), below: Grok 3 mini Reasoning (High)) - Qwen3.5 27B (Non-reasoning) — ELO 1745, #161/1064 (above: Gemma 4 31B, below: Claude Opus 4 (Reasoning)) - Qwen3 235B A22B 2507 (Reasoning) — ELO 1736, #169/1064 (above: Claude Opus 4 (Thinking), below: GPT-OSS-120B (High)) - K-EXAONE (Reasoning) — ELO 1735, #171/1064 (above: GPT-OSS-120B (High), below: GPT-5.1 Codex Mini) - GLM-4.6 (Reasoning) — ELO 1734, #173/1064 (above: GPT-5.1 Codex Mini, below: GLM-4.7) - DeepSeek V3.1 (Reasoning) — ELO 1724, #178/1064 (above: Claude Haiku 4.5 (Reasoning), below: Qwen3.5 9B (Reasoning)) - Qwen3.5 9B (Reasoning) — ELO 1723, #179/1064 (above: DeepSeek V3.1 (Reasoning), below: O3 (Low)) - Gemini 2.5 Flash Preview (Sep '25) (Reasoning) — ELO 1722, #182/1064 (above: Qwen 3.6 27B, below: O1 (High)) - Ling-2.6-1T — ELO 1720, #185/1064 (above: Qwen 3.5 27B, below: MiniMax-M2.5) - DeepSeek R1 0528 (May '25) — ELO 1719, #187/1064 (above: MiniMax-M2.5, below: Claude Sonnet 4 (Thinking)) - GLM-4.7-Flash (Reasoning) — ELO 1717, #189/1064 (above: Claude Sonnet 4 (Thinking), below: Qwen 3.5 122B A10B) AA TAU-2 Bench: 98.8 (#1/381) - Claude 3.7 Sonnet (Reasoning) — ELO 1712, #194/1064 (above: Nova 2.0 Pro Preview (Low), below: Claude Opus 4.1) - Gemma 4 26B A4B (Reasoning) — ELO 1710, #197/1064 (above: Claude Haiku 4.5 (Thinking), below: Claude Sonnet 4.5 (Non-reasoning)) - Gemma 4 31B (Non-reasoning) — ELO 1704, #204/1064 (above: O1, below: O4 Mini (Medium)) - Qwen3.5 35B A3B (Non-reasoning) — ELO 1702, #206/1064 (above: O4 Mini (Medium), below: GLM-4.6) - DeepSeek V3.2 (Non-reasoning) — ELO 1701, #208/1064 (above: GLM-4.6, below: Claude Opus 4) - Qwen3 VL 235B A22B (Reasoning) — ELO 1697, #214/1064 (above: Qwen 3.5 Plus, below: O1 (Medium)) - GLM-4.7 (Non-reasoning) — ELO 1696, #216/1064 (above: O1 (Medium), below: qwen3.5-flash) - Qwen3 Next 80B A3B (Reasoning) — ELO 1690, #218/1064 (above: qwen3.5-flash, below: GLM-4.5 (Reasoning)) - GLM-4.5 (Reasoning) — ELO 1690, #219/1064 (above: Qwen3 Next 80B A3B (Reasoning), below: Gemma 4 26B A4B) - Grok 4.20 0309 (Non-reasoning) — ELO 1686, #224/1064 (above: Nova 2.0 Omni (Medium), below: DeepSeek V3.2) - NVIDIA Nemotron 3 Nano 30B A3B (Reasoning) — ELO 1681, #228/1064 (above: Magistral Medium 1.2, below: Cogito v2.1 (Reasoning)) - Cogito v2.1 (Reasoning) — ELO 1681, #229/1064 (above: NVIDIA Nemotron 3 Nano 30B A3B (Reasoning), below: Gemini 2.5 Flash (Thinking)) - Grok 4.20 0309 v2 (Non-reasoning) — ELO 1680, #231/1064 (above: Gemini 2.5 Flash (Thinking), below: GPT-5 Mini) - DeepSeek V3.2 Exp (Non-reasoning) — ELO 1676, #235/1064 (above: Mercury 2, below: Claude Sonnet 4 (Non-reasoning)) - Qwen3.5 4B (Reasoning) — ELO 1675, #237/1064 (above: Claude Sonnet 4 (Non-reasoning), below: Grok 4.20) - Qwen3.6 35B A3B (Non-reasoning) — ELO 1670, #248/1064 (above: Arcee Trinity Large (Thinking), below: Mistral Small 4 (Reasoning)) - Mistral Small 4 (Reasoning) — ELO 1670, #249/1064 (above: Qwen3.6 35B A3B (Non-reasoning), below: Hermes 4 405B) - Qwen3 VL 32B (Reasoning) — ELO 1669, #252/1064 (above: O4 Mini, below: Qwen3 Coder Next) - Qwen3.5 9B (Non-reasoning) — ELO 1668, #254/1064 (above: Qwen3 Coder Next, below: GLM-4.5V) - Qwen3 235B A22B 2507 Instruct — ELO 1663, #259/1064 (above: GPT-5 Nano (High), below: GLM 4.5 Air) - Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) — ELO 1661, #261/1064 (above: GLM 4.5 Air, below: DeepSeek V3.1 Terminus (Non-reasoning)) - DeepSeek V3.1 Terminus (Non-reasoning) — ELO 1661, #262/1064 (above: Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning), below: Gemma 4 26B A4B (Non-reasoning)) - Gemma 4 26B A4B (Non-reasoning) — ELO 1661, #263/1064 (above: DeepSeek V3.1 Terminus (Non-reasoning), below: Qwen 3 Next 80B A3B (Thinking)) - DeepSeek V3.1 (Non-reasoning) — ELO 1655, #269/1064 (above: GPT-5 Mini (Low), below: GLM-4.6 (Non-reasoning)) - GLM-4.6 (Non-reasoning) — ELO 1654, #270/1064 (above: DeepSeek V3.1 (Non-reasoning), below: MiniMax-M2) - MiMo-V2-Flash (Non-reasoning) — ELO 1651, #279/1064 (above: GPT-OSS-120B (Low), below: Qwen3.5 Omni Flash) - Qwen3 30B A3B 2507 (Reasoning) — ELO 1642, #288/1064 (above: Ling 2.6 Flash, below: Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)) - Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning) — ELO 1639, #289/1064 (above: Qwen3 30B A3B 2507 (Reasoning), below: JT-MINI) - Claude 3.7 Sonnet (Non-reasoning) — ELO 1635, #296/1064 (above: Qwen 3.5 9B, below: K2 Think V2) - Llama Nemotron Super 49B v1.5 (Reasoning) — ELO 1628, #305/1064 (above: Qwen 3 VL 235B A22B, below: deepseek-v3p2 (Thinking)) - Qwen3 Coder 480B A35B Instruct — ELO 1624, #309/1064 (above: GPT-4.1, below: Qwen3 VL 30B A3B (Reasoning)) - Qwen3 VL 30B A3B (Reasoning) — ELO 1624, #310/1064 (above: Qwen3 Coder 480B A35B Instruct, below: HyperCLOVA X SEED Think (32B)) - GLM-4.6V (Reasoning) — ELO 1620, #318/1064 (above: Qwen 3 VL 32B (Thinking), below: Claude 3.7 Sonnet) - Qwen3.5 4B (Non-reasoning) — ELO 1618, #323/1064 (above: Qwen3 Next 80B A3B Instruct, below: Nova 2.0 Lite (Low)) - Hermes 4 - Llama-3.1 405B (Reasoning) — ELO 1612, #334/1064 (above: MiniMax M1 40k, below: GPT-5.4 Nano) - K-EXAONE (Non-reasoning) — ELO 1607, #341/1064 (above: Solar Pro 3, below: Ling-1T) - Solar Open 100B (Reasoning) — ELO 1605, #343/1064 (above: Ling-1T, below: EXAONE 4.0 32B (Reasoning)) - EXAONE 4.0 32B (Reasoning) — ELO 1604, #344/1064 (above: Solar Open 100B (Reasoning), below: Mi:dm K 2.5 Pro Preview) - Motif-2-12.7B-Reasoning — ELO 1599, #349/1064 (above: Grok 4.1 Fast, below: Gemini 2.5 Flash Lite (Thinking)) - DeepSeek R1 (Jan '25) — ELO 1596, #355/1064 (above: Mistral Large 3, below: Qwen3-4B-2507-Think) - Gemini 2.0 Flash Thinking Experimental (Jan '25) — ELO 1596, #357/1064 (above: Qwen3-4B-2507-Think, below: Grok 3 Mini) - GLM-4.7-Flash (Non-reasoning) — ELO 1595, #359/1064 (above: Grok 3 Mini, below: Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)) - Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning) — ELO 1593, #360/1064 (above: GLM-4.7-Flash (Non-reasoning), below: Grok 4.1 Fast (Non-reasoning)) - Grok 4.1 Fast (Non-reasoning) — ELO 1592, #361/1064 (above: Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning), below: Grok 4 Fast (Non-reasoning)) - Grok 4 Fast (Non-reasoning) — ELO 1592, #362/1064 (above: Grok 4.1 Fast (Non-reasoning), below: Nova 2.0 Pro Preview (Non-reasoning)) - Nova 2.0 Pro Preview (Non-reasoning) — ELO 1590, #363/1064 (above: Grok 4 Fast (Non-reasoning), below: GPT-OSS-20B (Low)) - Qwen3 4B 2507 (Reasoning) — ELO 1583, #371/1064 (above: GPT-5.4 Nano (Low), below: Gemini 2.5 Flash-Lite (Reasoning)) - Gemini 2.5 Flash-Lite (Reasoning) — ELO 1581, #372/1064 (above: Qwen3 4B 2507 (Reasoning), below: Qwen3 235B A22B (Reasoning)) - Qwen3 235B A22B (Reasoning) — ELO 1579, #373/1064 (above: Gemini 2.5 Flash-Lite (Reasoning), below: Gemini 2.5 Flash Lite) - Qwen3 30B A3B 2507 Instruct — ELO 1568, #382/1064 (above: DeepSeek V3, below: Baidu Ernie 5.0) - Hermes 4 - Llama-3.1 70B (Reasoning) — ELO 1566, #385/1064 (above: Qwen3 VL 30B A3B Instruct, below: GPT-4o (ChatGPT)) - Mistral Small 4 (Non-reasoning) — ELO 1559, #393/1064 (above: qwen3-32B (Thinking), below: Tri-21B-think Preview) - NVIDIA Nemotron Nano 12B v2 VL (Reasoning) — ELO 1558, #395/1064 (above: Tri-21B-think Preview, below: GPT-OSS-20B) - Gemma 4 E4B (Reasoning) — ELO 1556, #399/1064 (above: abab7, below: GPT-4o (March 2025, chatgpt-4o-latest)) - GPT-4o (March 2025, chatgpt-4o-latest) — ELO 1556, #400/1064 (above: Gemma 4 E4B (Reasoning), below: Qwen 3 VL 8B (Thinking)) - Solar Pro 2 (Preview) (Reasoning) — ELO 1548, #405/1064 (above: Median Human, below: GPT-5 Mini (Minimal)) - GPT-4o (Aug '24) — ELO 1547, #407/1064 (above: GPT-5 Mini (Minimal), below: Ring-flash-2.0) - Nova 2.0 Lite (Non-reasoning) — ELO 1543, #411/1064 (above: Devstral Small 2, below: Hermes 4 - Llama-3.1 405B (Non-reasoning)) - Hermes 4 - Llama-3.1 405B (Non-reasoning) — ELO 1543, #412/1064 (above: Nova 2.0 Lite (Non-reasoning), below: Qwen 3 Coder 30B A3B) - Qwen3 32B (Reasoning) — ELO 1542, #415/1064 (above: Nova Premier, below: GPT-4o) - NVIDIA Nemotron Nano 9B V2 (Reasoning) — ELO 1538, #420/1064 (above: Qwen 3 32B, below: Solar Pro 2 (Reasoning)) - Solar Pro 2 (Reasoning) — ELO 1537, #421/1064 (above: NVIDIA Nemotron Nano 9B V2 (Reasoning), below: Qwen3 VL 8B (Reasoning)) - Qwen3 VL 8B (Reasoning) — ELO 1536, #422/1064 (above: Solar Pro 2 (Reasoning), below: Qwen3 Omni 30B A3B (Reasoning)) - Qwen3 Omni 30B A3B (Reasoning) — ELO 1536, #423/1064 (above: Qwen3 VL 8B (Reasoning), below: Qwen 3 14B) - DeepSeek V3 (Dec '24) — ELO 1532, #428/1064 (above: Qwen2.5 VL 72B Instruct, below: Qwen 3 VL 8B) - Claude 3.5 Sonnet (Oct '24) — ELO 1530, #433/1064 (above: Kimi-VL-A3B-Thinking-2506, below: Devstral) - Llama 3.3 Nemotron Super 49B v1 (Reasoning) — ELO 1528, #436/1064 (above: Magistral Medium, below: Gemini 1.5 Pro) - GLM-4.6V (Non-reasoning) — ELO 1527, #439/1064 (above: Qwen 2.5 32B, below: Sonar) - GLM-4.5V (Reasoning) — ELO 1525, #441/1064 (above: Sonar, below: Solar Pro 2 (Preview) (Non-reasoning)) - Solar Pro 2 (Preview) (Non-reasoning) — ELO 1525, #442/1064 (above: GLM-4.5V (Reasoning), below: MiniMax-M1) - Nova 2.0 Omni (Non-reasoning) — ELO 1524, #444/1064 (above: MiniMax-M1, below: Qwen3 30B A3B (Reasoning)) - Qwen3 30B A3B (Reasoning) — ELO 1523, #445/1064 (above: Nova 2.0 Omni (Non-reasoning), below: Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)) - Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) — ELO 1523, #446/1064 (above: Qwen3 30B A3B (Reasoning), below: Tri-21B-Think) - Qwen3 14B (Reasoning) — ELO 1522, #448/1064 (above: Tri-21B-Think, below: Step3 VL 10B) - Devstral Small (May '25) — ELO 1517, #453/1064 (above: Ministral 3 14B, below: GPT-4) - Qwen2.5 Instruct 72B — ELO 1501, #477/1064 (above: Devstral Medium, below: Magistral Small) - GPT-4o (May '24) — ELO 1499, #481/1064 (above: Qwen-VL-Max, below: Mistral Large) - NVIDIA Nemotron Nano 9B V2 (Non-reasoning) — ELO 1498, #484/1064 (above: Grok Beta, below: Step-1.5V-mini) - Gemini 2.5 Flash-Lite (Non-reasoning) — ELO 1496, #490/1064 (above: ERNIE 4.5 300B A47B, below: Ministral 3 8B) - Qwen3 235B A22B (Non-reasoning) — ELO 1494, #495/1064 (above: Mistral Medium, below: Kimi-VL-A3B (Thinking)) - Gemini 2.0 Flash-Lite (Feb '25) — ELO 1492, #500/1064 (above: GPT-4o-mini (0718, detail-high), below: Llama 3.1 Tulu3 405B) - Llama 3.1 Tulu3 405B — ELO 1492, #501/1064 (above: Gemini 2.0 Flash-Lite (Feb '25), below: Qwen2.5 VL 7B Instruct) - Gemma 4 E4B (Non-reasoning) — ELO 1489, #507/1064 (above: Kimi-VL-A3B-Instruct, below: kimi-k2-0711-preview) - Grok 2 (Dec '24) — ELO 1487, #514/1064 (above: MiniMonkey, below: Llama 3.3 70B) - Qwen3 VL 4B (Reasoning) — ELO 1486, #517/1064 (above: elephant-alpha, below: Gemma 4 E2B (Reasoning)) - Gemma 4 E2B (Reasoning) — ELO 1486, #518/1064 (above: Qwen3 VL 4B (Reasoning), below: EXAONE 4.0 32B) - Devstral Small (Jul '25) — ELO 1484, #522/1064 (above: Llama 3.2 90B, below: InternVL2.5-4B) - Qwen3 4B 2507 Instruct — ELO 1483, #527/1064 (above: Qwen2.5 VL 32B Instruct, below: Llama 3.1 Nemotron 70B) - Qwen3.5 2B (Reasoning) — ELO 1481, #533/1064 (above: Mistral Small 3.1, below: VARCO-VISION-14B) - Llama 3.3 Instruct 70B — ELO 1480, #540/1064 (above: MiniCPM-V-2.6, below: Nova Pro) - Solar Pro 2 (Non-reasoning) — ELO 1476, #547/1064 (above: DeepSeek R1 0528 Qwen3 8B, below: granite-vision-3.3-2B) - Llama Nemotron Super 49B v1.5 (Non-reasoning) — ELO 1469, #554/1064 (above: Aria, below: EXAONE 4.0 32B (Non-reasoning)) - EXAONE 4.0 32B (Non-reasoning) — ELO 1469, #555/1064 (above: Llama Nemotron Super 49B v1.5 (Non-reasoning), below: InternVL3-2B) - Qwen2.5 Coder Instruct 32B — ELO 1467, #561/1064 (above: InternVL2.5-2B-MPO, below: Gemini 1.5 Flash) - Mistral Large 2 (Nov '24) — ELO 1466, #565/1064 (above: Gemini 1.5 Flash-8B, below: GPT-4v (0409, detail-low)) - Llama 3.1 Instruct 405B — ELO 1463, #567/1064 (above: GPT-4v (0409, detail-low), below: GPT-4o Mini) - NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning) — ELO 1460, #575/1064 (above: LLaVA-OneVision-7B (SI), below: InternVL2-2B) - Qwen3 14B (Non-reasoning) — ELO 1458, #578/1064 (above: Ovis1.5-Gemma2-9B, below: InternVL2-8B) - Llama 3.2 Instruct 90B (Vision) — ELO 1454, #586/1064 (above: VITA-1.5, below: Nanbeige4-3B-Thinking-2511) - Qwen3 8B (Reasoning) — ELO 1450, #591/1064 (above: OmChat-v2.0-13B, below: InternLM-XComposer2.5) - Qwen3 4B (Non-reasoning) — ELO 1448, #599/1064 (above: Gemma 2 27B (IT), below: Flash-VL-2B-Dynamic-ISS) - GLM-4.5V (Non-reasoning) — ELO 1445, #601/1064 (above: Flash-VL-2B-Dynamic-ISS, below: Ovis1.6-Llama3.2-3B) - Llama 3.1 Nemotron Instruct 70B — ELO 1442, #613/1064 (above: Qwen 2.5 VL 3B, below: Qwen 2.5 Coder 14B) - Gemma 4 E2B (Non-reasoning) — ELO 1440, #615/1064 (above: Qwen 2.5 Coder 14B, below: Yi-Vision) - Gemma 3 27B Instruct — ELO 1439, #619/1064 (above: Mixtral 8x22B, below: Qwen-VL-Plus) - Qwen3 30B A3B (Non-reasoning) — ELO 1437, #624/1064 (above: Nanbeige4.1-3B, below: QTuneVL1-2B) - Qwen3.5 2B (Non-reasoning) — ELO 1435, #628/1064 (above: InternVL2.5-8B, below: WeMM) - Qwen3 Omni 30B A3B Instruct — ELO 1434, #631/1064 (above: Llama 4 Scout, below: Llama 3.3 Nemotron Super 49B v1 (Non-reasoning)) - Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) — ELO 1434, #632/1064 (above: Qwen3 Omni 30B A3B Instruct, below: Qwen3 32B (Non-reasoning)) - Qwen3 32B (Non-reasoning) — ELO 1434, #633/1064 (above: Llama 3.3 Nemotron Super 49B v1 (Non-reasoning), below: MiniCPM-Llama3-V2.5) - Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) — ELO 1431, #639/1064 (above: Qwen 2.5 Coder 32B, below: Granite 4.0 H Small) - Qwen2 Instruct 72B — ELO 1424, #655/1064 (above: Gemma 3n E4B, below: MiniCPM-V-2) - NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning) — ELO 1420, #662/1064 (above: CogVLM2-19B-Chat, below: Ovis2-1B) - Qwen3 4B (Reasoning) — ELO 1419, #664/1064 (above: Ovis2-1B, below: Hermes 4 - Llama-3.1 70B (Non-reasoning)) - Hermes 4 - Llama-3.1 70B (Non-reasoning) — ELO 1417, #665/1064 (above: Qwen3 4B (Reasoning), below: Aquila-VL-2B) - DeepHermes 3 - Mistral 24B Preview (Non-reasoning) — ELO 1415, #670/1064 (above: InternVL2.5-1B-MPO, below: DeepSeek V2.5) - Llama 3.1 Instruct 70B — ELO 1407, #684/1064 (above: Yi 1.5 34B, below: OLMo 3 7B (Thinking)) - Qwen3 8B (Non-reasoning) — ELO 1395, #693/1064 (above: Cambrian-13B, below: DeepSeek-R1-Distill-Qwen-7B) - Gemma 3 12B Instruct — ELO 1394, #699/1064 (above: Jamba 1.5 Mini, below: GPT-3.5) - Mistral Large (Feb '24) — ELO 1386, #708/1064 (above: Qwen 2.5 7B, below: Mistral Small (Sep '24)) - Mistral Small (Sep '24) — ELO 1385, #709/1064 (above: Mistral Large (Feb '24), below: Jamba 1.7 Large) - Qwen2.5 Coder Instruct 7B — ELO 1383, #712/1064 (above: Mini-InternVL-Chat-2B-V1.5, below: Qwen 2 VL 7B) - Mixtral 8x22B Instruct — ELO 1377, #727/1064 (above: Janus-Pro-7B, below: LLaVA-Next-Interleave-7B) - Gemma 3n E4B Instruct Preview (May '25) — ELO 1376, #730/1064 (above: Molmo 7B-O, below: 360VL-70B) - Mistral Large 2 (Jul '24) — ELO 1368, #745/1064 (above: Monkey-Chat, below: Falcon2-VLM-11B) - Phi-4 Multimodal Instruct — ELO 1363, #753/1064 (above: InternLM-XComposer2-1.8B, below: Yi 34B (Chat)) - Exaone 4.0 1.2B (Reasoning) — ELO 1361, #755/1064 (above: Yi 34B (Chat), below: SOLAR-10.7B-Instruct-v1.0) - Llama 3.1 Instruct 8B — ELO 1355, #762/1064 (above: Phi-3.5-Vision, below: Qwen 2.5 3B) - Mistral Small (Feb '24) — ELO 1352, #772/1064 (above: CogVLM-17B-Chat, below: Gemini 1.0 Pro) - Qwen3 1.7B (Reasoning) — ELO 1350, #774/1064 (above: Gemini 1.0 Pro, below: bagel-8B-v1.0) - Sarvam M (Reasoning) — ELO 1349, #776/1064 (above: bagel-8B-v1.0, below: LLaVA-LLaMA-3-8B) - Llama 3.2 Instruct 11B (Vision) — ELO 1332, #808/1064 (above: Collaiborator-MEDLLM-Llama-3-8B-v2-6, below: VILA1.5-3B) - Exaone 4.0 1.2B (Non-reasoning) — ELO 1325, #830/1064 (above: Llama 2 70B, below: Yi-VL-6B) - Command-R+ (Apr '24) — ELO 1323, #837/1064 (above: aya-23-35B, below: Yi-1.5-dolphin-9B) - Gemma 3 4B Instruct — ELO 1314, #855/1064 (above: lft_8b_v2, below: Mistral-7B-v0.1) - Qwen3.5 0.8B (Reasoning) — ELO 1308, #868/1064 (above: Moondream2, below: InternLM-XComposer) - Phi-4 Mini Instruct — ELO 1307, #871/1064 (above: Starling-LM-7B-beta, below: Mantis-8B-clip-llama3) - Llama 3 Instruct 70B — ELO 1306, #875/1064 (above: Granite 4.0 H 1B, below: Peagle-9B) - Gemma 3n E4B Instruct — ELO 1294, #897/1064 (above: Nous-Hermes-2-Mistral-7B-DPO, below: Mistral 7B Instruct) - Llama 3.2 Instruct 3B — ELO 1285, #904/1064 (above: LFM2.5-1.2B (Thinking), below: Nemotron-4 15B) - DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning) — ELO 1281, #913/1064 (above: Apollo-6B, below: Granite 3.3 8B (Non-reasoning)) - Granite 3.3 8B (Non-reasoning) — ELO 1280, #914/1064 (above: DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning), below: BioMistral-Zephyr-Beta-SLERP) - Qwen3 0.6B (Reasoning) — ELO 1279, #916/1064 (above: BioMistral-Zephyr-Beta-SLERP, below: Myrrh_solar_10.7b_3.0) - Qwen3.5 0.8B (Non-reasoning) — ELO 1278, #921/1064 (above: AlphaMonarch-7B, below: BioMistralMerged) - Qwen3 1.7B (Non-reasoning) — ELO 1274, #926/1064 (above: Apollo-7B, below: SmolVLM-500M) - Command-R (Mar '24) — ELO 1248, #954/1064 (above: BioMistral-7B-TIES, below: Mantis-8B-Fuyu) - Llama 3 Instruct 8B — ELO 1242, #958/1064 (above: LFM2 2.6B, below: Qwen2.5-1.5B-Instruct) - MediKAI — ELO 1235, #963/1064 (above: PaLM 62B, below: JSL-MedPhi2-2.7B) - Qwen3 0.6B (Non-reasoning) — ELO 1223, #969/1064 (above: Baichuan-2-13B-Base, below: aya-23-8B) - Gemma 3n E2B Instruct — ELO 1217, #973/1064 (above: Gemma 3 1B, below: OLMo 2 32B) - Phi-3 Mini Instruct 3.8B — ELO 1203, #982/1064 (above: medicine-chat, below: Phi-2) - Gemma 3 1B Instruct — ELO 1174, #997/1064 (above: BioMistral-7B, below: xgen-7B-8k-base) - Llama 3.2 Instruct 1B — ELO 1142, #1016/1064 (above: falcon-7B, below: Chameleon-30B) - EMO-2B — ELO 1127, #1025/1064 (above: Qwen-1.8B, below: gpt-neox-20B)
NEW SCORES FROM TOP-10 MODELS (1) - GPT-5.5 (xHigh) on ARC-AGI-2: 85.0 Accuracy (%) (#2/144)
NEW #1 LEADERS (22) - Design Arena (Graphic Design) (Elo): gpt-image-2 (1477.0) beat chestnut (1354.0) by 123.0 - Design Arena (Image) (Elo): gpt-image-2 (1405.0) beat chestnut (1329.0) by 76.0 - Design Arena (Image Editing) (Elo): gpt-image-2 (1380.0) beat hazel (1327.0) by 53.0 - AA GDPval (ELO): GPT-5.5 (xhigh) (1784.53) beat Claude Opus 4.7 (max) (1752.68) by 31.85 - Vals AI Finance Agent (Accuracy (%)): qwen3.6-max-preview (86.67) beat claude-opus-4-7 (64.37) by 22.3 - Vellum - ARC-AGI-2 (Score (%)): GPT-5.5 (85.0) beat Claude Opus 4.6 (68.8) by 16.2 - Design Arena (UI Components) (Elo): claude-opus-4-7-thinking (1394.0) beat claude-opus-4-6 (1383.0) by 11.0 - Chatbot Arena (Document) (Elo): claude-opus-4-6-thinking (1528.0) beat claude-opus-4-7 (1521.0) by 7.0 - LiveBench Integrals With Game (Score): gpt-5.5-high (100.0) beat gpt-5.4-xhigh (93.0) by 7.0 - AA APEX-Agents (Pass@1 (%)): GPT-5.5 (xhigh) (37.7) beat GPT-5.4 (xhigh) (33.3) by 4.4 - AA CritPt (Accuracy (%)): GPT-5.5 (xhigh) (27.1) beat GPT-5.4 (xhigh) (23.4) by 3.7 - Terminal-Bench 2.0 (Accuracy (%)): gpt-5.5_unknown (82.0) beat gemini-3.1-pro-preview (78.4) by 3.6 - LLM Stats (BrowseComp) (Score (%)): GPT-5.5 Pro (90.1) beat Claude Mythos Preview (86.9) by 3.2 - AA Terminal-Bench Hard (Accuracy (%)): GPT-5.5 (xhigh) (60.6) beat GPT-5.4 (xhigh) (57.6) by 3.0 - Artificial Analysis Intelligence Index (Intelligence Index): GPT-5.5 (xhigh) (60.24) beat Claude Opus 4.7 (max) (57.28) by 2.96 - LiveBench Plot Unscrambling (Score): gpt-5.5-high (76.28) beat gemini-3.1-pro-preview-high (74.13) by 2.15 - LiveBench Table Join (Score): gpt-5.5-xhigh (54.5) beat gemini-3.1-pro-preview-high (52.35) by 2.15 - LLM Stats (Toolathlon) (Score (%)): GPT-5.5 (55.6) beat GPT-5.4 (54.6) by 1.0 - LLM Stats (VideoMME w sub.) (Score (%)): Qwen3.6-27B (87.7) beat Qwen3.5-122B-A10B (87.3) by 0.4 - Vals AI CorpFin v2 (Accuracy (%)): gpt-5.5 (68.42) beat kimi-k2.5-thinking (68.26) by 0.16 - LLM Stats (EmbSpatialBench) (Score (%)): Qwen3.6-27B (84.6) beat Qwen3.5-27B (84.5) by 0.1 - LLM Stats (RefSpatialBench) (Score (%)): Qwen3.6-27B (70.0) beat Qwen3 VL 235B A22B Thinking (69.9) by 0.1