AI Benchmark Digest — 2026-05-03
AI Benchmark Digest — 2026-05-03
=== DAILY === NEW BENCHMARKS (9) - Open-R1 Eval Leaderboard (Average Accuracy (%)): leader Qwen3-32B (73.74), 37 models - SeaEval (Average Score (%)): leader GPT4o_0513 (72.86), 30 models - FastEval (Total Score): leader GPT-4-0613 (77.78), 33 models - SciEvalKit (Scientific Capability Score): leader Gemini-3-Pro (48.74), 10 models - LLM Benchmarker Suite (Average Score (%)): leader LLaMA-2 (70B) (62.53), 8 models - LMArena Preference Proxy (Evaluator Accuracy (%)): leader gemma-2-9b-it (64.63), 4 models - LLMZSZL Leaderboard (Score): leader Qwen2.5-72B-Instruct (69.06), 99 models - Swahili LLM Leaderboard (Average Score (%)): leader Swahili Gemma (61.32), 5 models - MMLU-by-task Leaderboard (MMLU Average (%)): leader FashionGPT-70B-V1.1 (70.99), 1257 models
NEW SCORES FROM TOP-10 MODELS (1) - GPT-5.5 Pro on VoxelBench: 2119.0 Rating (#1/37)
NEW #1 LEADERS (6) - VoxelBench (Rating): GPT-5.5 Pro (2119.0) beat GPT-5.5 (xHigh) (2024.0) by 95.0 - Design Arena (Video Editing) (Elo): happy-horse-1.0 (1333.0) beat wan-v2.7-v2v (1322.0) by 11.0 - Spider 2.0-DBT (Accuracy (%)): Databao Agent (58.82) beat SignalPilot Agent (51.56) by 7.26 - Chess Puzzles (Epoch AI) (Accuracy (%)): gpt-5.5-pro-pre-release_xhigh (64.0) beat gpt-5.4-pro-2026-03-05_xhigh (58.6) by 5.4 - WeirdML (Average Score): gpt-5.5 (high) (83.9) beat gpt-5.3-codex (xhigh) (79.3) by 4.6 - BridgeBench Hallucination (Score): Grok 4.3 (79.8) beat Gemini 3.1 Pro (79.1) by 0.7