AI Benchmark Digest — 2026-03-26
Last 24 Hours
The last 24 hours in AI benchmarking have been dominated by a high-stakes shuffle at the top of the leaderboards, featuring a strong debut from Google’s latest preview and a massive correction in OCR performance.
- Google’s New Contender: The gemini-3.1-pro-preview made a powerful entrance, securing the #1 spot on SEAL - MultiChallenge with a score of 71.37. It also claimed #2 placements on SEAL - EnigmaEval (19.76) and SEAL - MultiNRC (64.74).
- Multi-Task Leadership Shift: By taking the lead on SEAL - MultiChallenge, gemini-3.1-pro-preview unseated gpt-5.4-pro-2026-03-05 with a significant +2.14 point delta.
- OCR Breakthrough: Kimi K2.5 has seized the lead on LLM Stats (OCRBench) with a score of 92.3%, overtaking Qwen3 VL 235B A22B Instruct following a massive score correction.
- Video Generation Shakeup: In the creative space, grok-imagine-video (1329.0 Elo) climbed to the #1 position on Design Arena (Video), successfully displacing libra.
- Internal Google Rivalry: On the Kaggle FACTS Parametric benchmark, Gemini 2.5 Pro reclaimed the lead with a score of 63.21%, pushing Gemini 3.1 Pro Preview down to second place.
Last 7 Days
The last seven days in AI benchmarking have been dominated by the arrival of the GPT-5.4 family and Gemini 3.1 Pro Preview, signaling a massive shift in reasoning and coding leaderboards.
- New Specialized Benchmarks: Several high-stakes evaluations debuted this week, most notably Kaggle FACTS, which tests grounding and search capabilities across four categories. SudokuBench was also introduced to stress-test multi-step logic, where GPT-5 High leads the 9x9 single-shot category with a 25.7% solve rate. Additionally, the Multi-turn Debate (Lechmazur) benchmark now ranks conversational reasoning, with Claude Sonnet 4.6 (high reasoning) taking the top spot with a 1617.5 Bradley-Terry rating.
- GPT-5.4 Dominance: The new gpt-5.4-pro-2026-03-05 and its variants have seized control of the SEAL suite, taking #1 spots in SEAL - EnigmaEval (23.82), SEAL - TutorBench (56.62), and the notoriously difficult SEAL - Humanity's Last Exam (44.32). In coding, gpt-5.4-2026-03-05-medium claimed the lead on SWE-rebench with 62.81% resolved, a significant +5.59 jump over the previous leader.
- Gemini 3.1 Pro Preview Emerges: Google’s latest Gemini 3.1 Pro Preview has proven to be a formidable challenger, taking the lead on SEAL - MultiChallenge with a score of 71.37 and topping the Kaggle FACTS (Google) leaderboard with a 67.71% average score. It also secured a strong #2 placement on SWE-rebench (62.32%).
- Kimi and Specialized Leaders: Kimi K2.5 made a splash by taking the lead on Multi-Docker-Eval (41.82% resolved) and LLM Stats (OCRBench) (92.3%). Meanwhile, Nanonets OCR-3 claimed the top position on the IDP Leaderboard with an 83.1% average score, and Famou-Agent 2.0 surged to the top of MLE-bench with a 64.44% medal rate.
Don't miss what's next. Subscribe to Mikhail Doroshenko: