AI Benchmark Digest — 2026-03-27
Last 24 Hours
Here is your summary of AI benchmark activity over the last 24 hours:
- Audio Intelligence Breakthrough: The gemini-3.1-flash-live-preview (Thinking) made a dominant debut on the SEAL - AudioMultiChallenge - Audio Output benchmark, securing the #1 spot with a score of 36.06. This performance successfully unseated gpt-realtime-1.5 from the lead with a significant delta of +1.33.
- New Leader in Financial Agents: The Distyl ButtonAgent surged to the top of the Tau3-Bench Banking_Knowledge leaderboard. It achieved a 31.2 Pass@1 (%) score, overtaking the previous leader, GPT-5.2, by a substantial margin of +5.7. Notably, the new GPT-5.4 also debuted on this benchmark, tying for the #2 spot with a 31.2 score.
- Vision and Search Rankings Shift: High-profile updates hit the Arena leaderboards, with claude-opus-4-6 capturing #2 on Chatbot Arena (Vision) with a 1284.0 score. Simultaneously, gemini-3.1-pro-grounding took the #2 position on Chatbot Arena (Search) with a 1219.0 score, signaling intensified competition in grounded retrieval.
- Specialized Benchmark Debuts: Several models made strong entries into niche evaluations, including MiniMax M2.5 (medium) taking #2 on MageBench S2 (1616.0 Rating) and Gemini 3 Flash Preview securing #2 on Kaggle FACTS Search with an 81.04% score. Additionally, foresight-v3 entered the ProphetArena at #3 with a 0.9315 Brier-based score.
Last 7 Days
The last seven days in AI benchmarking have seen a massive reshuffling of the leaderboards, dominated by the emergence of the GPT-5.4 family and Gemini 3.1 Pro Preview.
- New Reasoning and Factuality Benchmarks: Several high-difficulty evaluations debuted this week. SudokuBench tests multi-step logic, where GPT-5 High leads the 9x9 single-shot category with a 25.7% solve rate. The Kaggle FACTS suite now measures grounding and parametric memory, with Gemini 3.1 Pro Preview dominating Search (85.56%) and Gemini 2.5 Pro leading Grounding (76.17%). Additionally, ARC-AGI-3 appeared as a peak-difficulty test, with GPT-5.4 (High) currently leading with a modest 0.26% accuracy.
- GPT-5.4 and Gemini 3.1 Dominance: New models have immediately claimed top spots across diverse domains. gpt-5.4-pro-2026-03-05 took #1 on SEAL - Humanity's Last Exam (44.32) and SEAL - EnigmaEval (23.82). Meanwhile, Gemini 3.1 Pro Preview secured #1 on SEAL - MultiChallenge (71.37) and Kaggle FACTS (Google) (67.71%).
- Major Leaderboard Shifts in Coding and Logic: gpt-5.4-2026-03-05-medium seized the lead on SWE-rebench with 62.81% resolved, a significant +5.59 jump over the previous leader. In the specialized Multi-Docker-Eval, kimi-k2.5 took the top spot with 41.82%, while gpt-5.4-pro-2026-03-05_xhigh outperformed the field in Chess Puzzles (Epoch AI) with 58.6% accuracy.
- Specialized Agent and Vision Breakthroughs: Distyl ButtonAgent claimed the lead in Tau3-Bench Banking_Knowledge (31.2%), while Nanonets OCR-3 moved to #1 on the IDP Leaderboard (83.1). In creative evaluations, claude-opus-4-6-thinking took the lead in Design Arena (Data Viz) with a 1373.0 Elo.
Don't miss what's next. Subscribe to Mikhail Doroshenko: