AI Benchmark Digest — 2026-03-29
Last 24 Hours
The AI benchmark landscape saw significant movement in the last 24 hours, marked by a massive leap in reasoning capabilities and a shift in creative coding leadership.
- A New Peak in Reasoning: The IUMB leaderboard has a new champion. GPT-5.4 Thinking (avg@5) debuted at #1 with a score of 135.4%, shattering the previous record held by GPT-5.2 (xhigh) by a substantial +35.4 point margin.
- MathArena Expansion: The MathArena - Usamo 2026 benchmark was introduced to test elite-level competitive mathematics. GPT-5.4 (xhigh) immediately claimed the top spot among 6 models with a near-perfect 95.24% accuracy.
- Gemini Claims ASCII Crown: In a rare upset for creative rendering, gemini-3-flash-preview took the lead on ASCIIBench with an ELO of 1653.0, unseating the long-standing leader claude-opus-4.1.
- Engineering Optimization: Claude Sonnet 4.5 has officially overtaken its predecessor on the Kaggle WWTP Engineering leaderboard, securing the #1 position with a score of 84.62%.
Last 7 Days
The last seven days in AI benchmarking have been dominated by the arrival of the GPT-5.4 and Gemini 3.1 families, which are aggressively rewriting the leaderboards across reasoning, coding, and multimodal tasks.
Here are the key highlights:
- Reasoning and Logic Benchmarks Expand: Several high-difficulty benchmarks debuted this week. MathArena - Usamo 2026 saw GPT-5.4 (xhigh) take a dominant lead with 95.24% accuracy. Meanwhile, the new SudokuBench suite tested multi-step logic, where GPT-5 High led the 9x9 single-shot category (25.7%) and GPT-5 Med swept the 4x4 and 6x6 variants.
- GPT-5.4 Dominates SEAL and Coding: The new gpt-5.4-pro-2026-03-05 made a massive impact, seizing the top spot on SEAL - EnigmaEval (23.82), SEAL - TutorBench (56.62), and the grueling SEAL - Humanity's Last Exam (44.32). In software engineering, gpt-5.4-2026-03-05-medium claimed #1 on SWE-rebench with 62.81% of issues resolved.
- Gemini 3.1 Pro Preview Surges: Google’s latest iteration, Gemini 3.1 Pro Preview, took the lead on SEAL - MultiChallenge (71.37) and the new Kaggle FACTS (Google) (67.71%). Its "Thinking" variant, gemini-3.1-flash-live-preview (Thinking), also established a new ceiling for SEAL - AudioMultiChallenge - Audio Output with a score of 36.06.
- Anthropic and Specialized Models Hold Ground: Claude Sonnet 4.6 emerged as the leader in Multi-turn Debate (Lechmazur) with a 1617.5 Bradley-Terry rating, while claude-opus-4.6 overtook GPT-5.4 on PinchBench with a 93.3% success rate. In specialized domains, Nanonets OCR-3 took the lead on the IDP Leaderboard (83.1%) and Distyl ButtonAgent claimed #1 on Tau3-Bench Banking_Knowledge (31.2%).
- Frontier Reasoning Breakthrough: The IUMB benchmark saw a massive jump as [GPT-5.4+Thinking+(avg@5)](#/leaderboard?q
Don't miss what's next. Subscribe to Mikhail Doroshenko: