Mikhail Doroshenko

Archives
March 25, 2026

AI Benchmark Digest — 2026-03-25

Last 24 Hours

It has been a high-velocity 24 hours in the evaluation space, with next-generation iterations from OpenAI and Google dominating the leaderboards. Here is the breakdown:

  • Kaggle FACTS Suite Debuts: Four new benchmarks from the Kaggle FACTS Grounding series have launched to test model reliability. GPT-5.2 currently leads the Kaggle FACTS Grounding test with a 76.17% score, while Gemini 3.1 Pro Preview has established dominance in Kaggle FACTS Search (85.56%) and Kaggle FACTS Parametric (78.93%).
  • OpenAI Reclaims the Visual Tooling Crown: The new gpt-5.4-2026-03-05 (reasoning effort = high) has seized the #1 spot on SEAL - VisualToolBench (VTB) with a score of 29.17. This represents a significant +2.32 lead over the previous leader, gemini-3-pro-preview.
  • Video Generation Shakeup: A new model, libra, made a massive debut on the Design Arena (Video) leaderboard. With an Elo of 1341.0, it successfully dethroned grok-imagine-video by a margin of 15 points.
  • Reasoning and Logic Gains: The claude-opus-4-6-thinking model entered the top 3 on SEAL - VisualToolBench (VTB) (27.52), while GLM-5-Turbo nearly swept the AA TAU-2 Bench with a near-perfect 98.5% accuracy, landing at #2.
  • Multimodal Excellence: Gemini 2.5 Pro has set the pace for the new Kaggle FACTS Multimodal benchmark, leading a field of 21 models with a score of 46.86%.

Last 7 Days

The AI benchmark landscape saw a massive shakeup this week as OpenAI’s latest iterations dominated high-reasoning tasks, while Google’s newest previews reclaimed the lead in grounding and search.

  • New Specialized Benchmarks: A wave of industry-specific and grounding evaluations arrived. Tau3-Bench debuted with sub-tests for Airline, Retail, Telecom, and Banking Knowledge, testing agentic performance across sectors. Additionally, the Kaggle FACTS suite was introduced to measure Grounding, Multimodal, and Search capabilities, providing a rigorous look at factual consistency.
  • OpenAI’s GPT-5.4 Surge: The gpt-5.4-pro-2026-03-05 family made a dominant debut, sweeping the SEAL leaderboards. It took the #1 spot in EnigmaEval (23.82), MultiChallenge (69.23), and the notoriously difficult Humanity's Last Exam (44.32). Meanwhile, gpt-5.4-2026-03-05-medium claimed the throne on SWE-rebench with a 62.81% resolution rate, edging out previous versions by over 5%.
  • Google’s Gemini 3.1 Pro Preview: Google’s latest Gemini 3.1 Pro Preview showed exceptional strength in information retrieval and grounding, taking the lead on Kaggle FACTS (Google) with a 67.71% average score and dominating the Kaggle FACTS Search sub-test at 85.56%. It also secured a strong #2 spot on SWE-rebench (62.32%).
  • Open-Source and Niche Victories: The Open LLM Leaderboard saw a new champion in calme-3.2-instruct-78b, which jumped to #1 with an average score of 52.08. In specialized coding and vision tasks, kimi-k2.5 took the lead on Multi-Docker-Eval (41.82%), and Nanonets OCR-3 established itself as the top model on the IDP Leaderboard with an 83.1% average.
  • Reasoning and Creative Shifts: Claude Sonnet 4.6 (high reasoning) took the lead in the Multi-turn Debate (Lechmazur) with a 1617.5 rating, while libra debuted at #1 on the Design Arena (Video) leaderboard with a 1341.

View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.