AI Benchmark Digest — 2026-04-20
AI Benchmark Digest — 2026-04-20
=== DAILY === NEW BENCHMARKS (30) - ArtifactsBenchmark (Average Score): leader SWE-Bench (2294.0), 30 models - BIRD-Interact (c-Interact) (Normalized Reward): leader Gemini-2.5-Pro (20.92), 7 models - BIRD-Interact (a-Interact) (Normalized Reward): leader GPT-5 (25.52), 7 models - BIRD-CRITIC (Score): leader o3-mini-2025-01-31 (34.5), 6 models - MathBench (Average Score): leader GPT-4o-2024-05-13 (70.9), 34 models - Arena-Hard v2 (GPT-4.1 Judge) (Win Rate (%)): leader o3-2025-04-16 (87.0), 28 models - Arena-Hard Creative Writing (Win Rate (%)): leader gemini-2.5 (90.8), 28 models - Chatbot Arena (Document) (Elo): leader claude-opus-4-7 (1521.0), 20 models - MMSI-Bench (Accuracy (%)): leader Gemini-3-pro (49.2), 38 models - EmoBench-M (Average Score): leader Gemini-3.0-Pro (70.5), 26 models - OrchestrationBench (Average Score): leader claude-opus-4-7 (Bedrock) (85.07), 17 models - LLM-AggreFact (Balanced Accuracy (%)): leader Bespoke-Minicheck-7B (77.41), 39 models - DROP (F1 Score): leader o1 (90.2), 31 models - LLM Stats (Claw-Eval) (Score (%)): leader GLM-5V-Turbo (75.0), 5 models - LLM Stats (EmbSpatialBench) (Score (%)): leader Qwen3.5-27B (84.5), 5 models - LLM Stats (RefSpatialBench) (Score (%)): leader Qwen3 VL 235B A22B Thinking (69.9), 5 models - LLM Stats (ZEROBench-Sub) (Score (%)): leader Qwen3.5-122B-A10B (36.2), 5 models - HindiGen (3C3H Score (%)): leader o3-2025-04-16 (85.56), 35 models - Arabic IFEval (Arabic Accuracy (%)): leader claude-3.5-sonnet (75.9), 40 models - Turkish MMLU (Accuracy (%)): leader gpt-4o (84.8), 66 models - IgakuQA119 (Accuracy (%)): leader Gemini-2.5-Pro (97.0), 27 models - PM-LLM-Benchmark (Score): leader gpt-5.4-2026-03-05-XHIGH (37.8), 128 models - ResearcherBench (Coverage (%)): leader OpenAI Deep Research (70.3), 10 models - CyberMetric (Accuracy (%)): leader GPT-4o (92.45), 25 models - CyberBench (NLP) (Avg Score (%)): leader GPT-4 (74.03), 13 models - TACTL (Accuracy (%)): leader DeepSeek-R1 (95.9), 8 models - SecCodePLT (Score (%)): leader CodeLlama-34B-Instruct (67.88), 6 models - CyberSecEval-3 (Score (%)): leader Llama-3-70B (44.14), 6 models - RedCode (Score (%)): leader Llama-3.1-70B-Instruct (75.54), 15 models - NYU CTF Bench (Pass@1 (%)): leader Claude-3.5-Sonnet (16.25), 3 models
NEW MODELS (6) - Qwen3.6 Max Preview — ELO 1954, #19/902 (above: GPT-5.2 (Medium), below: GPT-5.2 Codex (xHigh)) - medllama3-v10 — ELO 1371, #638/902 (above: LLaVA-LLaMA-3-8B, below: GPT-3) - medllama3-v11 — ELO 1364, #648/902 (above: suzume-llama-3-8B-multilingual, below: ollama_v7) - Collaiborator-MEDLLM-Llama-3-8B-v2-1 — ELO 1351, #679/902 (above: Emu3_chat, below: Qwen-14B-Chat) - Collaiborator-MEDLLM-Llama-3-8B-v2-4 — ELO 1349, #685/902 (above: JSL-MedMNX-7B-SFT, below: LFM2 8B A1B) - Collaiborator-MEDLLM-Llama-3-8B-v2-3 — ELO 1346, #692/902 (above: Medical-Llama3-8B, below: Llama 2 Chat 13B)
NEW #1 LEADERS (1) - Chatbot Arena (Vision) (Arena Score): claude-opus-4-7-thinking (1307.0) beat claude-opus-4-6-thinking (1302.0) by 5.0