Mikhail Doroshenko

Archives
March 25, 2026

AI Benchmark Digest — 2026-03-25

Last 24 Hours

The last 24 hours in AI benchmarking have been dominated by the rollout of the Kaggle FACTS suite and a shakeup in visual tool manipulation. Here is the breakdown:

  • The Kaggle FACTS suite has debuted, introducing four distinct evaluations for grounding and retrieval. Gemini 3.1 Pro Preview has claimed early dominance, leading both the Kaggle FACTS Search (85.56%) and Kaggle FACTS Parametric (78.93%) leaderboards.
  • A new leader emerged on the SEAL - VisualToolBench (VTB) leaderboard. The gpt-5.4-2026-03-05 (reasoning effort = high) took the top spot with a score of 29.17, ousting the previous leader by a significant margin of +2.32.
  • High-reasoning models are flooding the visual tool space. Following the new leader on SEAL - VisualToolBench (VTB), gemini-3.1-pro-preview and claude-opus-4-6-thinking made strong debuts, placing #2 (28.97) and #3 (27.52) respectively.
  • GLM 5 Turbo made a near-perfect debut on the AA TAU-2 Bench, securing the #2 position with a 98.5% accuracy score.
  • Grounding and Multimodality gaps are widening. While GPT-5.2 leads the Kaggle FACTS Grounding at 76.17%, the Kaggle FACTS Multimodal benchmark remains a challenge for the field, with Gemini 2.5 Pro leading at a much lower 46.86%.

Last 7 Days

The AI benchmark landscape saw a massive shakeup this week as OpenAI's latest iterations dominated the leaderboards, while Google and specialized open-source models carved out significant wins in grounding and coding.

  • OpenAI’s GPT-5.4 Surge: The new gpt-5.4-pro-2026-03-05 has effectively staged a "clean sweep" of the SEAL evaluations, taking the #1 spot in EnigmaEval (23.82), MultiChallenge (69.23), TutorBench (56.62), and the notoriously difficult Humanity's Last Exam (44.32). Additionally, gpt-5.4-2026-03-05-medium seized the lead on SWE-rebench with a 62.81% resolution rate, edging out previous versions by over 5 points.
  • Google’s Grounding Dominance: Gemini 3.1 Pro Preview made a strong debut, immediately taking the lead on the Kaggle FACTS (Google) average score (67.71%). It specifically excelled in the new Kaggle FACTS Search (85.56%) and Kaggle FACTS Parametric (78.93%) benchmarks, which test a model's ability to retrieve and verify information accurately.
  • New Industry-Specific Benchmarks: A suite of "Tau" agentic benchmarks arrived to test real-world task automation. Qwen3.5-397B-A17B dominated the Tau3-Bench Telecom (97.8%) and Retail (84.4%) sectors, while Claude Opus 4.5 took the lead in Tau3-Bench Airline with an 84.0% Pass@1 rate.
  • Open-Source and Specialized Wins: The Open LLM Leaderboard has a new king in calme-3.2-instruct-78b, which jumped to #1 with a 52.08 average score. In the coding and infrastructure space, kimi-k2.5 successfully took the lead on Multi-Docker-Eval with a 41.82% resolution rate, while Nanonets OCR-3 reclaimed the top spot on the IDP Leaderboard (83.1%).
  • Reasoning and Logic Milestones: High-reasoning models are pushing boundaries in niche logic tests. gpt-5.4-pro-2026-03-05_xhigh took the lead in Chess Puzzles (Epoch AI) with 58.6% accuracy, while Claude Sonnet 4.6 established itself as the premier debater, leading the [

View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.