AI Benchmark Digest — 2026-05-04
AI Benchmark Digest — 2026-05-04
=== DAILY === NEW MODELS (62) - Doubao Seed Code — ELO 1645, #209/778 (above: Qwen 3 235B A22B 2507 (Reasoning), below: K-EXAONE (Reasoning)) - K-EXAONE (Reasoning) — ELO 1645, #210/778 (above: Doubao Seed Code, below: O4 Mini (High)) - Gemini 2.5 Flash Preview (Sep '25) (Reasoning) — ELO 1638, #221/778 (above: Nova 2.0 Pro Preview (Low), below: DeepSeek V3.1 (Thinking)) - Gemma 4 31B (Non-reasoning) — ELO 1626, #232/778 (above: Kimi K2.5 (Non-reasoning), below: Gemma 4 26B A4B (Reasoning)) - ERNIE 5.0 Thinking Preview — ELO 1622, #240/778 (above: DeepSeek V3.2, below: Claude Opus 4) - EXAONE 4.5 33B — ELO 1619, #245/778 (above: qwen3.5-flash, below: Grok 4 Fast (Reasoning)) - Nemotron Cascade 2 30B A3B — ELO 1591, #288/778 (above: Gemini 2.5 Flash (Thinking), below: GLM-4.7 (Non-reasoning)) - Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) — ELO 1579, #309/778 (above: Tencent HY 2.0 Instruct, below: DeepSeek V3.1 (Non-reasoning)) - Gemma 4 26B A4B (Non-reasoning) — ELO 1579, #311/778 (above: DeepSeek V3.1 (Non-reasoning), below: MiniMax M1 80k) - JT-MINI — ELO 1567, #329/778 (above: GPT-5.4 Mini (Low), below: Hermes 4 405B) - MiniMax M1 40k — ELO 1551, #347/778 (above: Claude Haiku 4.5, below: Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)) - K2 Think V2 — ELO 1550, #349/778 (above: Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning), below: DeepSeek V3.1) - HyperCLOVA X SEED Think (32B) — ELO 1546, #354/778 (above: Ling-1T, below: GPT-4.1) - Mi:dm K 2.5 Pro — ELO 1536, #369/778 (above: Qwen 3.5 4B (Non-reasoning), below: Qwen 3.5 4B) - Gemini 2.0 Flash Thinking Experimental (Jan '25) — ELO 1534, #373/778 (above: Qwen 3.5 9B, below: Kimi K2) - K-EXAONE (Non-reasoning) — ELO 1527, #380/778 (above: Qwen 3 VL 32B (Thinking), below: Qwen 3 VL 235B A22B (Thinking)) - Solar Pro 3 — ELO 1525, #383/778 (above: LongCat-Flash-Chat, below: GLM-4.7 Flash (Non-reasoning)) - Solar Open 100B (Reasoning) — ELO 1521, #388/778 (above: Qwen 3 Next 80B A3B, below: GPT-5.4 Nano) - Mi:dm K 2.5 Pro Preview — ELO 1514, #392/778 (above: GPT-OSS-120B, below: Claude 3.5 Sonnet) - EXAONE 4.0 32B (Reasoning) — ELO 1511, #396/778 (above: Devstral 2, below: Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)) - GPT-4o (ChatGPT) — ELO 1493, #416/778 (above: Mistral Large 3, below: GPT-4.1 Mini) - GPT-4o (March 2025, chatgpt-4o-latest) — ELO 1492, #418/778 (above: GPT-4.1 Mini, below: GPT-5.4 Nano (Low)) - Gemma 4 E4B (Reasoning) — ELO 1475, #438/778 (above: Hermes 4 - Llama-3.1 70B (Reasoning), below: GPT-4o (Aug '24)) - Solar Pro 2 (Preview) (Reasoning) — ELO 1474, #441/778 (above: abab7, below: GPT-4 Turbo) - Solar Pro 2 (Reasoning) — ELO 1456, #463/778 (above: Gemini 2.0 Flash Lite, below: Solar Pro 2 (Preview) (Non-reasoning)) - Solar Pro 2 (Preview) (Non-reasoning) — ELO 1456, #464/778 (above: Solar Pro 2 (Reasoning), below: Qwen 3 VL 8B) - Llama 3.3 Nemotron Super 49B v1 (Reasoning) — ELO 1448, #476/778 (above: Nova 2.0 Omni (Non-reasoning), below: Qwen 3 14B (Reasoning)) - Step3 VL 10B — ELO 1446, #479/778 (above: Qwen 2.5 Max, below: Qwen3 Omni 30B A3B (Reasoning)) - Tri-21B-Think — ELO 1441, #489/778 (above: Pixtral Large, below: DeepSeek R1 Distill Llama 70B) - NVIDIA Nemotron 3 Nano 4B — ELO 1436, #496/778 (above: Qwen 3 30B A3B (Thinking), below: Qwen 3 30B A3B) - Gemini 2.0 Flash-Lite (Feb '25) — ELO 1433, #498/778 (above: Qwen 3 30B A3B, below: QwQ-32B) - Granite 4.1 30B — ELO 1431, #501/778 (above: K2-V2 (Low), below: Qwen 1.5 110B) - Llama 3.1 Tulu3 405B — ELO 1425, #509/778 (above: Mistral-Small-Instruct-2409, below: Command A) - Gemma 4 E2B (Reasoning) — ELO 1405, #531/778 (above: Llama 4 Maverick, below: Gemma 4 E4B (Non-reasoning)) - Gemma 4 E4B (Non-reasoning) — ELO 1405, #532/778 (above: Gemma 4 E2B (Reasoning), below: Llama 3.1 70B) - Solar Pro 2 (Non-reasoning) — ELO 1403, #538/778 (above: Qwen 3 8B, below: ERNIE-4.5-21B-A3B (Thinking)) - Gemini 1.5 Flash-8B — ELO 1400, #540/778 (above: ERNIE-4.5-21B-A3B (Thinking), below: WizardLM-2 8x22B) - QwQ 32B-Preview — ELO 1385, #556/778 (above: Qwen 3 4B 2507, below: DeepSeek R1 0528 Qwen3 8B) - EXAONE 4.0 32B (Non-reasoning) — ELO 1382, #559/778 (above: Qwen 3 14B (Non-reasoning), below: Qwen 3 4B (Non-reasoning)) - Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) — ELO 1364, #576/778 (above: Jamba 1.6 Large, below: Ministral 3 14B) - Gemma 4 E2B (Non-reasoning) — ELO 1355, #586/778 (above: Mistral Small, below: OLMo 3 32B (Thinking)) - Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) — ELO 1347, #593/778 (above: Ministral 3 3B, below: Hermes 4 - Llama-3.1 70B (Non-reasoning)) - DeepHermes 3 - Mistral 24B Preview (Non-reasoning) — ELO 1343, #596/778 (above: Phi-3-small-8k-instruct, below: DeepSeek V2) - Qwen2.5 Coder Instruct 7B — ELO 1312, #622/778 (above: Gemma 3 4B, below: Command-R+) - Ling-mini-2.0 — ELO 1305, #627/778 (above: OLMo 3 7B (Thinking), below: Gemma 2 27B) - Gemma 3n E4B Instruct Preview (May '25) — ELO 1304, #629/778 (above: Gemma 2 27B, below: Llama 3.1 8B) - Granite 4.1 3B — ELO 1302, #632/778 (above: Qwen 2.5 7B, below: Claude 2.1) - Jamba Reasoning 3B — ELO 1275, #641/778 (above: Claude 3 Sonnet, below: Qwen 1.5 14B) - LFM 40B — ELO 1265, #650/778 (above: Ministral-8B-Instruct-2410, below: Qwen 3 1.7B (Thinking)) - Exaone 4.0 1.2B (Reasoning) — ELO 1259, #655/778 (above: SOLAR-10.7B-Instruct-v1.0, below: LFM2 8B A1B) - Llama 2 Chat 13B — ELO 1242, #667/778 (above: Qwen-14B, below: Mistral 7B Instruct) - Exaone 4.0 1.2B (Non-reasoning) — ELO 1234, #671/778 (above: Qwen3.5 0.8B (Reasoning), below: Llama 3.2 3B) - Granite 4.0 H 1B — ELO 1205, #688/778 (above: Llama 3 3B, below: Qwen 3 0.6B) - Molmo 7B-D — ELO 1204, #690/778 (above: Qwen 3 0.6B, below: LFM2.5-1.2B) - DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning) — ELO 1198, #694/778 (above: Starling-LM-7B-beta, below: Granite 4.0 1B) - Granite 4.0 1B — ELO 1197, #695/778 (above: DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning), below: Granite 3.3 8B (Non-reasoning)) - OLMo 2 32B — ELO 1141, #728/778 (above: LFM2 1.2B, below: SmolLM2-1.7B-Instruct) - Phi-3 Mini Instruct 3.8B — ELO 1126, #733/778 (above: Qwen2.5-0.5B-Instruct, below: Gemma 3 1B) - Granite 4.0 350M — ELO 1103, #739/778 (above: internlm-7B, below: mpt-30B) - Gemma 3 270M — ELO 1088, #744/778 (above: vicuna-13B-v1.1, below: Yi 6B (Base)) - Granite 4.0 H 350M — ELO 1077, #748/778 (above: falcon-7B, below: Baichuan-2-7B-Base) - OLMo 2 7B — ELO 1071, #752/778 (above: opt-13B, below: Llama 3.2 1B)
NEW SCORES FROM TOP-10 MODELS (1) - GPT-5.5 Pro on VoxelBench: 2125.0 Rating (#1/37)
NEW #1 LEADERS (1) - VoxelBench (Rating): GPT-5.5 Pro (2125.0) beat GPT-5.5 (xHigh) (2022.0) by 103.0