AI Benchmark Digest — 2026-04-11
AI Benchmark Digest — 2026-04-11
=== DAILY === NEW BENCHMARKS (20) - YKS 2025 LLM Leaderboard (Total Score (out of 200)): leader GPT-5 (194.0), 8 models - KOFFVQA (Overall Score (%)): leader gemini-2.5-pro-exp-03-25 (89.67), 82 models - StickToYourRole (Cardinal Score): leader Qwen2.5-VL-72B-Instruct (83.6), 32 models - TRAIL GAIA (Joint Accuracy): leader Gemini-2.5-Pro-Preview-05-06 (18.3), 8 models - TRAIL SWE (Joint Accuracy): leader Gemini-2.5-Pro-Preview-05-06 (5.0), 5 models - IFEval Leaderboard (Final Score): leader LLama 3 70B (83.31), 27 models - FACTS Leaderboard (Combined Score (%)): leader DeepSeek-R1-Distill-Qwen-14B (45.76), 34 models - CPTU Bench (Average Score (1-5)): leader Qwen3.5-27B thinking (API) (4.34), 93 models - Polish EQ-Bench (EQ-Bench Score): leader Mistral-Large-Instruct-2407 (78.07), 102 models - ShaderMatch (Clone Match Rate (%)): leader starcoder2-15b (18.78), 49 models - FMNB Leaderboard (Score): leader Llama-3.3-70B-Instruct (100.0), 49 models - Open Italian LLM Leaderboard (F1 Score): leader Gemma_QA_ITA_v3 (73.19), 16 models - Subquadratic LLM Leaderboard (Average Score): leader v5-EagleX-v2-7B-HF (48.66), 39 models - ClinicBench (Average Score): leader Task-specific SOTA (69.44), 23 models - Q-Bench (Overall Accuracy (%)): leader BlueImage-GPT (Close-Source) (83.48), 25 models - ChineseSafe Benchmark (Accuracy (%)): leader Deepexi-Guard-3B (78.26), 55 models - Open Persian LLM Leaderboard (Average Score (%)): leader Llama-3.3-70B-Instruct (70.36), 31 models - RABBITS (B4BQA Score (%)): leader platform.openai.com/docs/models" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">GPT-4 (99.71), 26 models - Evals for Every Language (Average Score (%)): leader gemini-2.5-flash (62.59), 34 models - Compl-AI Board (Average Compliance Score (%)): leader gpt-4-1106-preview (86.44), 15 models
NEW #1 LEADERS (6) - Design Arena (Video) (Elo): seedance-2.0 (1342.0) beat grok-imagine-video (1309.0) by 33.0 - Chatbot Arena (Vision) (Arena Score): claude-opus-4-6-thinking (1302.0) beat claude-opus-4-6 (1295.0) by 7.0 - ASCIIBench (ELO Rating): claude-opus-4.5 (1668.0) beat claude-opus-4.1 (1663.0) by 5.0 - Vals AI SAGE (Accuracy (%)): gemma-4-31b-it (55.03) beat claude-opus-4-5-20251101-thinking (52.09) by 2.94 - SEAL - Humanity's Last Exam (Score): gemini-3.1-pro-preview (thinking high) (46.44) beat gpt-5.4-pro-2026-03-05 (44.32) by 2.12 - SEAL - Humanity's Last Exam (Text Only) (Score): gemini-3.1-pro-preview (thinking high) (47.31) beat gpt-5.4-pro-2026-03-05 (45.32) by 1.99