Mikhail Doroshenko

Archives
March 28, 2026

AI Benchmark Digest — 2026-03-28

Last 24 Hours

Here is your daily update on the AI benchmark landscape:

  • DeepSeek Continues Momentum: The newly released DeepSeek V3.2 made a strong debut on MageBench S2, securing the #2 spot with a high rating of 1616.0. This placement reinforces the model's competitive edge in specialized reasoning and coding tasks.
  • Claude Reclaims the Throne: In a significant shift for agentic reliability, claude-opus-4.6 has seized the lead on PinchBench. It achieved a 93.3% Success Rate, successfully ousting gpt-5.4 from the top position with a notable performance delta of +2.83%.
  • Stable Benchmark Environment: No new benchmarks were introduced in the last 24 hours, leaving the focus entirely on the shifting rankings between top-tier frontier models.

Last 7 Days

The AI benchmark landscape saw a massive shakeup this week as OpenAI’s GPT-5.4 and Google’s Gemini 3.1 Pro Preview went head-to-head across several new reasoning and grounding evaluations.

  • Reasoning and Logic Benchmarks: A suite of new tests debuted to push high-reasoning models to their limits. SudokuBench was introduced to test multi-step logic; GPT-5 High dominated the 9x9 single-shot category with a 25.7% solve rate. Meanwhile, the ARC-AGI-3 benchmark remains a formidable wall, with GPT-5.4 (High) leading despite a low 0.26% accuracy.
  • The Rise of Gemini 3.1: Google’s Gemini 3.1 Pro Preview made a powerful entrance, seizing the #1 spot on Kaggle FACTS (Google) with a 67.71% average score and taking the lead on SEAL - MultiChallenge (71.37). Its specialized variant, Gemini 3.1 Flash Live Preview (Thinking), also claimed the top spot in SEAL - AudioMultiChallenge - Audio Output with a score of 36.06.
  • GPT-5.4 Dominates SEAL and Coding: OpenAI's gpt-5.4-pro-2026-03-05 swept the SEAL leaderboards, taking the lead in EnigmaEval (23.82), TutorBench (56.62), and the grueling Humanity's Last Exam (44.32). Additionally, gpt-5.4-2026-03-05-medium reclaimed the throne on SWE-rebench, resolving 62.81% of issues.
  • Specialized Agent Success: New niche leaders emerged as Distyl ButtonAgent took the lead on Tau3-Bench Banking_Knowledge with a 31.2% Pass@1, while Nanonets OCR-3 moved to #1 on the IDP Leaderboard with an 83.1% average score.
  • Grounding and Search: The new Kaggle FACTS suite provided a granular look at model reliability. GPT-5.2 leads in Grounding (76.17%), while Gemini 3.1 Pro Preview leads in Search (85.56%) and Gemini 2.5 Pro tops Parametric knowledge (63.

View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.