Mikhail Doroshenko

Archives
April 3, 2026

AI Benchmark Digest — 2026-04-03

Last 24 Hours

The last 24 hours in AI benchmarking have been dominated by the sudden arrival of "future-dated" model variants and a massive sweep by the Qwen family. Here are the key highlights:

  • Massive Expansion of Agentic and Specialized Benchmarks: A wave of high-stakes evaluations debuted, including SEAL - Agentic Tool Use (Enterprise) (led by o1 (December 2024) with 70.14) and HAL USACO, where GPT-5 Medium (August 2025) took the top spot with 69.71%. Other notable additions include React Native Evals and LiveOIBench, testing mobile development and human-level reasoning respectively.
  • Qwen3.6 Plus Dominates LLM Stats: The new Qwen3.6 Plus has staged a near-total takeover of the LLM Stats suite. It seized the lead in 10 different categories, including LLM Stats (PolyMATH) (77.4%), LLM Stats (MCP Atlas) (74.1%), and LLM Stats (C-Eval) (93.3%), consistently outperforming previous Qwen3.5 and GPT-5 iterations.
  • GPT-5.4 and Gemini 3 Flash Make Strong Debuts: New high-tier models are climbing the ranks quickly. GPT-5.4 debuted at #1 on the Kaggle Game Arena Werewolf with an Equilibrium Rating of 0.00078, while Gemini 3 Flash Preview surged to the lead on Kaggle FACTS Parametric with a score of 72.26%, a significant +9.05 point jump over the previous leader.
  • Major Shifts in Coding and Reasoning Arenas: The competitive landscape shifted as claude-opus-4-6-thinking took the lead in Chatbot Arena (Code) with an Elo of 1546.0. Meanwhile, gpt-5.4-mini established dominance in strategic reasoning, taking #1 on GACL - Battleship with a 91.18 normalized score.

Last 7 Days

No significant benchmark changes in the last 7 days.


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.