Mikhail Doroshenko

Archives
March 30, 2026

AI Benchmark Digest — 2026-03-30

Last 24 Hours

No significant benchmark changes in the last 24 hours.


Last 7 Days

The last seven days in AI benchmarking have been dominated by the arrival of the GPT-5.4 and Gemini 3.1 families, which have collectively rewritten the leaderboards across reasoning, coding, and multimodal grounding.

  • New Reasoning and Logic Benchmarks: A suite of high-difficulty evaluations debuted this week. MathArena - Usamo 2026 saw GPT-5.4 (xhigh) take a dominant lead with 95.24% accuracy. Meanwhile, SudokuBench tested multi-step logic across grid sizes, where GPT-5 High led the 9x9 single-shot category (25.7%), and Multi-turn Debate (Lechmazur) crowned Claude Sonnet 4.6 (high reasoning) as the top debater with a 1617.5 Bradley-Terry rating.
  • Kaggle FACTS and Grounding: The new Kaggle FACTS series introduced rigorous testing for factual accuracy. Gemini 3.1 Pro Preview took the lead in Kaggle FACTS Search (85.56%), while GPT-5.2 topped the Kaggle FACTS Grounding pillar (76.17%).
  • GPT-5.4 and Gemini 3.1 Debuts: The leaderboard landscape shifted as gpt-5.4-2026-03-05-medium debuted at #1 on SWE-rebench (62.81%), and gemini-3.1-pro-preview secured #1 on SEAL - MultiChallenge (71.37%). Other notable entries include DeepSeek V3.2 taking #2 on MageBench S2 and claude-opus-4-6 landing at #2 on Chatbot Arena (Vision).
  • Major Leaderboard Upsets: GPT-5.4 Thinking (avg@5) shattered previous records on IUMB with a 135.4% score, a massive +35.4 delta over the previous leader. In specialized tasks, Distyl ButtonAgent (31.2%) overtook GPT-5.2 on Tau3-Bench Banking_Knowledge, and Nanonets OCR-3 (83.1%) claimed the top spot on the IDP Leaderboard.
  • Humanity's Last Exam: The prestigious SEAL - Humanity's Last Exam saw a change at the top, with gpt-5.4-pro-2026-03-05 scoring 44.32%, successfully displacing the previous leader, gemini-3-pro-preview, by a significant 6.8 point margin.

View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.