Mikhail Doroshenko

Archives
March 31, 2026

AI Benchmark Digest — 2026-03-31

Last 24 Hours

Here is your daily update on the AI benchmark landscape:

  • Strategic Dominance in Game Theory: The Kaggle Game Arena Four in a Row saw a massive shakeup as Gemini 3.1 Pro Preview debuted at #1 with a 676.1 Elo. It successfully dethroned the previous leader, Gemini 3 Pro Preview, by a staggering margin of +199.1 points.
  • Next-Gen Model Sightings: Two highly anticipated models made their first appearances on the leaderboards. GPT-5.4 entered the Kaggle Game Arena Four in a Row at #2 (585.38 Elo), while Claude Opus 4.6 debuted at #2 on the Kaggle Game Arena Poker (Heads Up) with a 31.88 Mean BB/100.
  • Engineering and Forecasting Shifts: Claude Sonnet 4 claimed the top spot on the Kaggle WWTP Engineering benchmark with a score of 84.62%, edging out Claude Sonnet 4.5.
  • Crowd-Augmented Success: On ForecastBench, Gemini-3-Pro-Preview (zero shot with crowd forecast) took the lead with an overall score of 67.9, surpassing the previous leader, Cassi ensemble_2_crowdadj.

Last 7 Days

It has been a high-octane week for AI evaluations, characterized by the arrival of the GPT-5.4 family and a surge in specialized reasoning and grounding benchmarks.

Here is the breakdown of the last 7 days in AI benchmarks:

  • Reasoning and Logic Benchmarks Explode: A suite of new tests arrived to challenge "thinking" models. MathArena - Usamo 2026 saw GPT-5.4 (xhigh) dominate with a 95.24% accuracy. Meanwhile, SudokuBench tested multi-step logic across various grid sizes; GPT-5 High led the 4x4 multi-step category (80.0%), but performance plummeted across the board on 6x6 grids, with GPT-5 Med leading at just 13.3%.
  • The Rise of Kaggle FACTS: Four new grounding and factuality benchmarks were introduced. Gemini 3.1 Pro Preview took the top spot in Kaggle FACTS Search (85.56%), while Gemini 2.5 Pro led both the Multimodal (46.86%) and Parametric (63.21%) tiers. GPT-5.2 secured the lead in Grounding with 76.17%.
  • Gemini 3.1 Pro Preview Dominates SEAL: The new gemini-3.1-pro-preview made a massive splash, taking the lead on SEAL - MultiChallenge with a score of 71.37 (ousting gpt-5.4-pro-2026-03-05). It also secured #2 spots on SEAL - EnigmaEval and SEAL - MultiNRC.
  • Major Leaderboard Shifts: Several established leaders were dethroned this week. claude-opus-4.6 took the lead in PinchBench with a 93.3% success rate. In the agentic space, Distyl ButtonAgent claimed #1 on Tau3-Bench Banking_Knowledge (31.2%), while gemini-3.1-flash-live-preview (Thinking) surged to the top of SEAL - AudioMultiChallenge - Audio Output (36.06).
  • Arena Updates: The Chatbot Arena (Vision) saw claude-opus-4-6 debut at #2 with a 1284.0 score. In the gaming sector, Gemini 3.1 Pro Preview crushed the Kaggle Game Arena Four in a Row leaderboard with a 676.1 Elo, a nearly 200-point jump over the previous leader.

View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.