Mikhail Doroshenko

Archives
Log in
May 7, 2026

AI Benchmark Digest — 2026-05-07

AI Benchmark Digest — 2026-05-07

=== DAILY === NEW BENCHMARKS (19) - LIBRA - MatreshkaNames * (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (81.2), 7 models - LIBRA - ruSciPassageCount * (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (25.77), 7 models - LIBRA - ru2WikiMultihopQA * (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (66.63), 7 models - LIBRA - LongContextMultiQ * (Dataset Total Score (%)): leader 01-ai_Yi-34B-200K (53.14), 7 models - LIBRA - LibrusecMHQA * (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (51.0), 7 models - LIBRA - ruBABILongQA3 * (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (38.38), 7 models - Kernel Arena - KernelBench HIP (Mean Correctness+Speedup): leader GPT-5.2 (15.463), 11 models - Kernel Arena - WaferBench NVFP4 (Mean Correctness+Speedup): leader Gemini 3.1 Pro (2.274), 4 models - MathArena - ARXIV_FALSE April (Accuracy (%)): leader GPT-5.5 (xhigh) (72.13), 6 models - MathArena - ARXIV April (Accuracy (%)): leader GPT-5.5 (xhigh) (65.48), 6 models - METR Benchmark (80% Horizon) (80% Time Horizon (hours)): leader gemini 3 1 pro (1.5), 25 models - LLM Stats (HealthBench) (Score (%)): leader Kimi K2-Thinking-0905 (58.0), 5 models - SCORE Robustness (Accuracy) (Average Accuracy (%)): leader Llama-3.1-70B-Instruct (67.02), 6 models - SCORE Robustness (Consistency) (Average Consistency Rate (%)): leader Llama-3.1-70B-Instruct (72.39), 6 models - Multilingual MMLU Leaderboard (Average Accuracy (%)): leader Claude-3.5-Sonnet (77.39), 17 models - Pinocchio Italian Leaderboard (Average Accuracy (%)): leader gemma-2-27b-it (70.97), 45 models - Ukrainian LLM Leaderboard (Average Score (%)): leader gemma-4-26B-A4B-it (reasoning) (63.29), 13 models - Arabic Broad Leaderboard (Average Score (0-10)): leader gemini-3-pro-preview (9.204), 87 models - Darija Chatbot Arena (Elo Rating): leader GPT-4o (1404.8), 13 models

NEW #1 LEADERS (3) - FoodTruckBench (Net Worth ($)): GPT-5.5 (61408.0) beat Claude Opus 4.6 (49519.0) by 11889.0 - ASCIIBench (ELO Rating): claude-opus-4.5 (1656.0) beat claude-opus-4.1 (1651.0) by 5.0 - Kaggle FACTS Parametric (Score (%)): Gemini 3.1 Pro Preview (78.96) beat GPT-5.5 (78.04) by 0.92


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.