Mikhail Doroshenko

Archives
March 25, 2026

AI Benchmark Digest — 2026-03-25

Last 24 Hours

The last 24 hours in AI benchmarking have been dominated by the arrival of specialized logic tests and a significant shake-up in visual reasoning and video generation leaderboards.

  • Logic and Factuality Testing: The new SudokuBench suite debuted to test multi-step and single-shot reasoning; while GPT-5 High leads the 4x4 grid at 80.0%, performance drops sharply on 9x9 grids (25.7%). Simultaneously, the Kaggle FACTS series launched to measure grounding and search capabilities, with Gemini 3.1 Pro Preview dominating the Search (85.56%) and Parametric (78.93%) categories.
  • Visual Reasoning Breakthrough: GPT-5.4-2026-03-05 (reasoning effort = high) made a massive debut, seizing the #1 spot on SEAL - VisualToolBench (VTB) with a score of 29.17. It narrowly edged out the new Gemini 3.1 Pro Preview (28.97) and Claude Opus 4-6 Thinking (27.52).
  • Video Generation Shift: A new contender, Libra, has taken the lead on the Design Arena (Video) leaderboard. With an Elo of 1342.0, it successfully dethroned Grok-Imagine-Video by a margin of 16 points.
  • High-Accuracy Coding and Tools: GLM-5-Turbo entered the AA TAU-2 Bench at #2 with a near-perfect 98.5% accuracy, while GPT 5.4 secured the #2 spot on CLAW-Eval with an 80.6% score.

Last 7 Days

The AI benchmark landscape saw a massive shakeup this week as OpenAI’s latest iterations dominated high-reasoning and agentic tasks, while Google and Qwen maintained strongholds in specialized industry domains.

  • OpenAI’s GPT-5.4 Series Sweeps the Board: The new gpt-5.4-pro-2026-03-05 has established a dominant lead across the SEAL suite, taking the #1 spot in EnigmaEval (23.82), MultiChallenge (69.23), and the notoriously difficult Humanity's Last Exam (44.32). Meanwhile, gpt-5.4-2026-03-05-medium claimed the top spot on SWE-rebench with a 62.81% resolution rate, edging out previous leaders by a significant +5.59 delta.
  • New Benchmarks Target Agency and Logic: Several new evaluations debuted this week, including Tau3-Bench, which tests model performance in specific retail and telecom environments—where Qwen3.5-397B-A17B leads with scores up to 97.8%. We also saw the introduction of SudokuBench, a multi-step logic test where GPT-5 High leads 9x9 puzzles with a 25.7% solve rate, and Kaggle FACTS, a grounding and search evaluation dominated by Gemini 3.1 Pro Preview (85.56% in Search).
  • Google and Anthropic Fight Back in Reasoning: While OpenAI took many leads, Gemini 3.1 Pro Preview secured #1 on Kaggle FACTS (Google) with a 67.71% average. Anthropic’s Claude Sonnet 4.6 (high reasoning) debuted at the top of the Multi-turn Debate (Lechmazur) with a 1617.5 Bradley-Terry rating, proving its strength in sustained rhetorical logic.
  • Specialized Model Breakthroughs: In the open-weights and niche sectors, kimi-k2.5 surged to #1 on Multi-Docker-Eval with 41.82%, while calme-3.2-instruct-78b took the lead on the Open LLM Leaderboard with an average score of 52.08. For visual and document tasks, Nanonets OCR-3 claimed the IDP Leaderboard (83.1%) and libra took the lead in Design Arena (Video) with a 1342.0 Elo.

View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.