AI Benchmark Digest — 2026-05-09
AI Benchmark Digest — 2026-05-09
=== DAILY === NEW BENCHMARKS (8) - Factory Code Review Benchmark (Mean F1 (%)): leader GPT-5.2 (60.5), 13 models Factory benchmark for code review quality, scoring model comments against expected findings with mean F1 across realistic pull request review tasks. - EuroEval Albanian NLU (NLU Average Score (%)): leader gemini-3.1-pro-preview (61.17), 208 models Albanian-language EuroEval natural-language-understanding suite, separating NLU task performance from the broader all-task EuroEval aggregate. - EuroEval Bosnian NLU (NLU Average Score (%)): leader Ministral-3-14B-Reasoning-2512 (66.0), 214 models Bosnian-language EuroEval natural-language-understanding suite, separating NLU task performance from the broader all-task EuroEval aggregate. - EuroEval Albanian Knowledge (Knowledge Average Score (%)): leader gemini-3-flash-preview#thinking (96.46), 167 models EuroEval Albanian knowledge category: language-specific factual or domain-knowledge tasks from EuroEval's public albanian_all.csv leaderboard, scored as the average task score for each model. - EuroEval Albanian Common Sense Reasoning (Common Sense Reasoning Average Score (%)): leader gemini-3.1-pro-preview (85.24), 155 models EuroEval Albanian common-sense reasoning category: language-specific commonsense tasks from EuroEval's public albanian_all.csv leaderboard, scored as the average task score for each model. - IMO-Bench (Advanced ProofBench Accuracy (%)): leader Aletheia (91.9), 9 models Advanced IMO-ProofBench leaderboard for rigorous mathematical proof writing on olympiad-level problems. - ChartMuseum (Overall Accuracy (%)): leader Gemini-3.1-Pro (80.7), 22 models Chart question-answering benchmark over real-world charts, testing visual, textual, and synthesis reasoning. - SvelteBench (Average pass@1 (%)): leader claude-opus-4-6 (100.0), 123 models Frontend coding benchmark for Svelte component tasks, scored by average pass@1.
NEW MODELS (1) - Grok 4.3 (Non-reasoning) — ELO 1647, #259/843 (above: O1 Preview, below: DeepSeek V3.2 Exp) AA GDPval: 1306.14 (#52/360) AA MMMU-Pro: 64.8 (#88/188) AA Omniscience: -32.3 (#121/388) Artificial Analysis Intelligence Index: 31.02 (#139/482) AA SciCode: 37.4 (#146/477) AA TAU-2 Bench: 65.8 (#148/402) AA Terminal-Bench Hard: 18.9 (#149/397) AA IFBench: 47.6 (#165/411) AA CritPt: 0.0 (#182/388) AA Humanity's Last Exam: 6.5 (#226/479)
NEW SCORES FROM TOP-10 MODELS (1) - GPT-5.5 (xHigh) on Wolfram LLM Benchmarking Project: 68.8 Correct Functionality (%) (#6/451)
NEW #1 LEADERS (4) - FrontierMath - Tier 4 (Accuracy (%, 48 problems)): AI co-mathematician (47.9) beat GPT-5.5 Pro (xhigh) (39.6) by 8.3 - METR Benchmark (50% Time Horizon (hours)): claude mythos preview early (17.41) beat claude opus 4 6 (11.98) by 5.43 - METR Benchmark (80% Horizon) (80% Time Horizon (hours)): claude mythos preview early (3.1) beat gemini 3 1 pro (1.5) by 1.6 - ForecastBench (Overall Score (higher is better)): green tree (68.2) beat Cassi ensemble_2_crowdadj (67.8) by 0.4