AI Benchmark Digest — 2026-05-08
AI Benchmark Digest — 2026-05-08
=== DAILY === NEW BENCHMARKS (8) - EuroEval Albanian NLU (NLU Average Score (%)): leader gemini-3.1-pro-preview (61.17), 208 models - EuroEval Bosnian NLU (NLU Average Score (%)): leader Ministral-3-14B-Reasoning-2512 (66.0), 214 models - EuroEval Albanian Knowledge (Knowledge Average Score (%)): leader gemini-3-flash-preview#thinking (96.46), 167 models - EuroEval Albanian Common Sense Reasoning (Common Sense Reasoning Average Score (%)): leader gemini-3.1-pro-preview (85.24), 155 models - MoNaCo (F1): leader o3 (61.18), 15 models - IMO-Bench (Advanced ProofBench Accuracy (%)): leader Aletheia (91.9), 9 models - ChartMuseum (Overall Accuracy (%)): leader Gemini-3.1-Pro (80.7), 22 models - SvelteBench (Average pass@1 (%)): leader claude-opus-4-6 (100.0), 123 models
NEW #1 LEADERS (3) - SEAL - AudioMultiChallenge - Audio Output (Score): gpt-realtime-2 (xHigh) (48.45) beat gemini-3.1-flash-live-preview (Thinking) (36.06) by 12.39 - Story Theory Bench (Score (%)): glm-5 (99.6) beat deepseek-v3.2 (92.2) by 7.4 - SEAL - SWE Atlas - Codebase QnA (Score): GPT 5.5 (Codex) (45.43) beat Gpt 5.4 xHigh (Codex) (40.8) by 4.63