AI Benchmark Digest — 2026-03-31
Last 24 Hours
Here is your daily update on the AI benchmark landscape:
- Strategic Dominance in Game Theory: The Kaggle Game Arena Four in a Row saw a massive shakeup as Gemini 3.1 Pro Preview debuted at #1 with a 676.1 Elo. It successfully dethroned the previous leader, Gemini 3 Pro Preview, by a staggering margin of +199.1 points.
- Next-Gen Model Sightings: Two highly anticipated models made their first appearances on the leaderboards. GPT-5.4 entered the Kaggle Game Arena Four in a Row at #2 (585.38 Elo), while Claude Opus 4.6 debuted at #2 on the Kaggle Game Arena Poker (Heads Up) with a 31.88 Mean BB/100.
- Engineering and Forecasting Shifts: Claude Sonnet 4 claimed the top spot on the Kaggle WWTP Engineering benchmark with a score of 84.62%, edging out Claude Sonnet 4.5.
- Crowd-Augmented Success: On ForecastBench, Gemini-3-Pro-Preview (zero shot with crowd forecast) took the lead with an overall score of 67.9, surpassing the previous leader, Cassi ensemble_2_crowdadj.
Last 7 Days
It has been a high-octane week for AI evaluations, characterized by the arrival of the GPT-5.4 family and a surge in specialized reasoning and grounding benchmarks.
Here is the breakdown of the last 7 days in AI benchmarks:
- Reasoning and Logic Benchmarks Explode: A suite of new tests arrived to challenge "thinking" models. MathArena - Usamo 2026 saw GPT-5.4 (xhigh) dominate with a 95.24% accuracy. Meanwhile, SudokuBench tested multi-step logic across various grid sizes; GPT-5 High led the 4x4 multi-step category (80.0%), but performance plummeted across the board on 6x6 grids, with GPT-5 Med leading at just 13.3%.
- The Rise of Kaggle FACTS: Four new grounding and factuality benchmarks were introduced. Gemini 3.1 Pro Preview took the top spot in Kaggle FACTS Search (85.56%), while Gemini 2.5 Pro led both the Multimodal (46.86%) and Parametric (63.21%) tiers. GPT-5.2 secured the lead in Grounding with 76.17%.
- Gemini 3.1 Pro Preview Dominates SEAL: The new gemini-3.1-pro-preview made a massive splash, taking the lead on SEAL - MultiChallenge with a score of 71.37 (ousting gpt-5.4-pro-2026-03-05). It also secured #2 spots on SEAL - EnigmaEval and SEAL - MultiNRC.
- Major Leaderboard Shifts: Several established leaders were dethroned this week. claude-opus-4.6 took the lead in PinchBench with a 93.3% success rate. In the agentic space, Distyl ButtonAgent claimed #1 on Tau3-Bench Banking_Knowledge (31.2%), while gemini-3.1-flash-live-preview (Thinking) surged to the top of SEAL - AudioMultiChallenge - Audio Output (36.06).
- Arena Updates: The Chatbot Arena (Vision) saw claude-opus-4-6 debut at #2 with a 1284.0 score. In the gaming sector, Gemini 3.1 Pro Preview crushed the Kaggle Game Arena Four in a Row leaderboard with a 676.1 Elo, a nearly 200-point jump over the previous leader.
Don't miss what's next. Subscribe to Mikhail Doroshenko: