AI Benchmark Digest — 2026-05-04


            
        May 4, 2026
    
    
AI Benchmark Digest — 2026-05-04


        AI Benchmark Digest — 2026-05-04
=== DAILY ===
NEW MODELS (62)
  - Doubao Seed Code — ELO 1645, #209/778 (above: Qwen 3 235B A22B 2507 (Reasoning), below: K-EXAONE (Reasoning))
  - K-EXAONE (Reasoning) — ELO 1645, #210/778 (above: Doubao Seed Code, below: O4 Mini (High))
  - Gemini 2.5 Flash Preview (Sep '25) (Reasoning) — ELO 1638, #221/778 (above: Nova 2.0 Pro Preview (Low), below: DeepSeek V3.1 (Thinking))
  - Gemma 4 31B (Non-reasoning) — ELO 1626, #232/778 (above: Kimi K2.5 (Non-reasoning), below: Gemma 4 26B A4B (Reasoning))
  - ERNIE 5.0 Thinking Preview — ELO 1622, #240/778 (above: DeepSeek V3.2, below: Claude Opus 4)
  - EXAONE 4.5 33B — ELO 1619, #245/778 (above: qwen3.5-flash, below: Grok 4 Fast (Reasoning))
  - Nemotron Cascade 2 30B A3B — ELO 1591, #288/778 (above: Gemini 2.5 Flash (Thinking), below: GLM-4.7 (Non-reasoning))
  - Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) — ELO 1579, #309/778 (above: Tencent HY 2.0 Instruct, below: DeepSeek V3.1 (Non-reasoning))
  - Gemma 4 26B A4B (Non-reasoning) — ELO 1579, #311/778 (above: DeepSeek V3.1 (Non-reasoning), below: MiniMax M1 80k)
  - JT-MINI — ELO 1567, #329/778 (above: GPT-5.4 Mini (Low), below: Hermes 4 405B)
  - MiniMax M1 40k — ELO 1551, #347/778 (above: Claude Haiku 4.5, below: Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning))
  - K2 Think V2 — ELO 1550, #349/778 (above: Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning), below: DeepSeek V3.1)
  - HyperCLOVA X SEED Think (32B) — ELO 1546, #354/778 (above: Ling-1T, below: GPT-4.1)
  - Mi:dm K 2.5 Pro — ELO 1536, #369/778 (above: Qwen 3.5 4B (Non-reasoning), below: Qwen 3.5 4B)
  - Gemini 2.0 Flash Thinking Experimental (Jan '25) — ELO 1534, #373/778 (above: Qwen 3.5 9B, below: Kimi K2)
  - K-EXAONE (Non-reasoning) — ELO 1527, #380/778 (above: Qwen 3 VL 32B (Thinking), below: Qwen 3 VL 235B A22B (Thinking))
  - Solar Pro 3 — ELO 1525, #383/778 (above: LongCat-Flash-Chat, below: GLM-4.7 Flash (Non-reasoning))
  - Solar Open 100B (Reasoning) — ELO 1521, #388/778 (above: Qwen 3 Next 80B A3B, below: GPT-5.4 Nano)
  - Mi:dm K 2.5 Pro Preview — ELO 1514, #392/778 (above: GPT-OSS-120B, below: Claude 3.5 Sonnet)
  - EXAONE 4.0 32B (Reasoning) — ELO 1511, #396/778 (above: Devstral 2, below: Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning))
  - GPT-4o (ChatGPT) — ELO 1493, #416/778 (above: Mistral Large 3, below: GPT-4.1 Mini)
  - GPT-4o (March 2025, chatgpt-4o-latest) — ELO 1492, #418/778 (above: GPT-4.1 Mini, below: GPT-5.4 Nano (Low))
  - Gemma 4 E4B (Reasoning) — ELO 1475, #438/778 (above: Hermes 4 - Llama-3.1 70B (Reasoning), below: GPT-4o (Aug '24))
  - Solar Pro 2 (Preview) (Reasoning) — ELO 1474, #441/778 (above: abab7, below: GPT-4 Turbo)
  - Solar Pro 2 (Reasoning) — ELO 1456, #463/778 (above: Gemini 2.0 Flash Lite, below: Solar Pro 2 (Preview) (Non-reasoning))
  - Solar Pro 2 (Preview) (Non-reasoning) — ELO 1456, #464/778 (above: Solar Pro 2 (Reasoning), below: Qwen 3 VL 8B)
  - Llama 3.3 Nemotron Super 49B v1 (Reasoning) — ELO 1448, #476/778 (above: Nova 2.0 Omni (Non-reasoning), below: Qwen 3 14B (Reasoning))
  - Step3 VL 10B — ELO 1446, #479/778 (above: Qwen 2.5 Max, below: Qwen3 Omni 30B A3B (Reasoning))
  - Tri-21B-Think — ELO 1441, #489/778 (above: Pixtral Large, below: DeepSeek R1 Distill Llama 70B)
  - NVIDIA Nemotron 3 Nano 4B — ELO 1436, #496/778 (above: Qwen 3 30B A3B (Thinking), below: Qwen 3 30B A3B)
  - Gemini 2.0 Flash-Lite (Feb '25) — ELO 1433, #498/778 (above: Qwen 3 30B A3B, below: QwQ-32B)
  - Granite 4.1 30B — ELO 1431, #501/778 (above: K2-V2 (Low), below: Qwen 1.5 110B)
  - Llama 3.1 Tulu3 405B — ELO 1425, #509/778 (above: Mistral-Small-Instruct-2409, below: Command A)
  - Gemma 4 E2B (Reasoning) — ELO 1405, #531/778 (above: Llama 4 Maverick, below: Gemma 4 E4B (Non-reasoning))
  - Gemma 4 E4B (Non-reasoning) — ELO 1405, #532/778 (above: Gemma 4 E2B (Reasoning), below: Llama 3.1 70B)
  - Solar Pro 2 (Non-reasoning) — ELO 1403, #538/778 (above: Qwen 3 8B, below: ERNIE-4.5-21B-A3B (Thinking))
  - Gemini 1.5 Flash-8B — ELO 1400, #540/778 (above: ERNIE-4.5-21B-A3B (Thinking), below: WizardLM-2 8x22B)
  - QwQ 32B-Preview — ELO 1385, #556/778 (above: Qwen 3 4B 2507, below: DeepSeek R1 0528 Qwen3 8B)
  - EXAONE 4.0 32B (Non-reasoning) — ELO 1382, #559/778 (above: Qwen 3 14B (Non-reasoning), below: Qwen 3 4B (Non-reasoning))
  - Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) — ELO 1364, #576/778 (above: Jamba 1.6 Large, below: Ministral 3 14B)
  - Gemma 4 E2B (Non-reasoning) — ELO 1355, #586/778 (above: Mistral Small, below: OLMo 3 32B (Thinking))
  - Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) — ELO 1347, #593/778 (above: Ministral 3 3B, below: Hermes 4 - Llama-3.1 70B (Non-reasoning))
  - DeepHermes 3 - Mistral 24B Preview (Non-reasoning) — ELO 1343, #596/778 (above: Phi-3-small-8k-instruct, below: DeepSeek V2)
  - Qwen2.5 Coder Instruct 7B — ELO 1312, #622/778 (above: Gemma 3 4B, below: Command-R+)
  - Ling-mini-2.0 — ELO 1305, #627/778 (above: OLMo 3 7B (Thinking), below: Gemma 2 27B)
  - Gemma 3n E4B Instruct Preview (May '25) — ELO 1304, #629/778 (above: Gemma 2 27B, below: Llama 3.1 8B)
  - Granite 4.1 3B — ELO 1302, #632/778 (above: Qwen 2.5 7B, below: Claude 2.1)
  - Jamba Reasoning 3B — ELO 1275, #641/778 (above: Claude 3 Sonnet, below: Qwen 1.5 14B)
  - LFM 40B — ELO 1265, #650/778 (above: Ministral-8B-Instruct-2410, below: Qwen 3 1.7B (Thinking))
  - Exaone 4.0 1.2B (Reasoning) — ELO 1259, #655/778 (above: SOLAR-10.7B-Instruct-v1.0, below: LFM2 8B A1B)
  - Llama 2 Chat 13B — ELO 1242, #667/778 (above: Qwen-14B, below: Mistral 7B Instruct)
  - Exaone 4.0 1.2B (Non-reasoning) — ELO 1234, #671/778 (above: Qwen3.5 0.8B (Reasoning), below: Llama 3.2 3B)
  - Granite 4.0 H 1B — ELO 1205, #688/778 (above: Llama 3 3B, below: Qwen 3 0.6B)
  - Molmo 7B-D — ELO 1204, #690/778 (above: Qwen 3 0.6B, below: LFM2.5-1.2B)
  - DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning) — ELO 1198, #694/778 (above: Starling-LM-7B-beta, below: Granite 4.0 1B)
  - Granite 4.0 1B — ELO 1197, #695/778 (above: DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning), below: Granite 3.3 8B (Non-reasoning))
  - OLMo 2 32B — ELO 1141, #728/778 (above: LFM2 1.2B, below: SmolLM2-1.7B-Instruct)
  - Phi-3 Mini Instruct 3.8B — ELO 1126, #733/778 (above: Qwen2.5-0.5B-Instruct, below: Gemma 3 1B)
  - Granite 4.0 350M — ELO 1103, #739/778 (above: internlm-7B, below: mpt-30B)
  - Gemma 3 270M — ELO 1088, #744/778 (above: vicuna-13B-v1.1, below: Yi 6B (Base))
  - Granite 4.0 H 350M — ELO 1077, #748/778 (above: falcon-7B, below: Baichuan-2-7B-Base)
  - OLMo 2 7B — ELO 1071, #752/778 (above: opt-13B, below: Llama 3.2 1B)
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 Pro on VoxelBench: 2125.0 Rating (#1/37)
NEW #1 LEADERS (1)
  - VoxelBench (Rating): GPT-5.5 Pro (2125.0) beat GPT-5.5 (xHigh) (2022.0) by 103.0

View on AI Benchmark Hub
    

                                Don't miss what's next. Subscribe to Mikhail Doroshenko:
                            
                        
            Email address (required)
            
            
                    ← Newer
                
                AI Benchmark Digest — 2026-05-05
            
        
                    Older →
                
                AI Benchmark Digest — 2026-05-04