Mikhail Doroshenko

Archives
Log in
May 27, 2026

AI Benchmark Digest — 2026-05-27

AI Benchmark Digest — 2026-05-27

=== DAILY === NEW SCORES FROM TOP-10 MODELS (1) - GPT-5.4 (xHigh) on Creative Writing (Lechmazur): 3.2 Mean Score (#2/25)

NEW #1 LEADERS (11) - LLM Chess (Saplin) (ELO): GPT-5.5 (Medium) (1532.2) beat Gemini 3.1 Pro (1511.4) by 20.8 - LLM Stats (PolyMATH) (Score (%)): Qwen 3.7 Max (86.5) beat Qwen 3.6 Plus (77.4) by 9.1 - LLM Stats (MCP-Mark) (Score (%)): Qwen 3.7 Max (60.8) beat Kimi K2.6 (55.9) by 4.9 - LLM Stats (NL2Repo) (Score (%)): Qwen 3.7 Max (47.2) beat GLM-5.1 (42.7) by 4.5 - LLM Stats (MMLU-ProX) (Score (%)): Qwen 3.7 Max (87.0) beat Qwen 3.6 Plus (84.7) by 2.3 - LLM Stats (HMMT Feb 26) (Score (%)): Qwen 3.7 Max (97.1) beat DeepSeek V4 Pro (Max) (95.2) by 1.9 - LLM Stats (MAXIFE) (Score (%)): Qwen 3.7 Max (89.2) beat Qwen 3.6 Plus (88.2) by 1.0 - LLM Stats (Include) (Score (%)): Qwen 3.7 Max (86.2) beat Qwen 3.5 397B A17B (85.6) by 0.6 - LLM Stats (IMO-AnswerBench) (Score (%)): Qwen 3.7 Max (90.0) beat DeepSeek V4 Pro (Max) (89.8) by 0.2 - Creative Writing (Lechmazur) (Mean Score): GPT-5.5 (Thinking, xHigh) (3.2) beat GPT-5.5 (3.0) by 0.2 - LLM Stats (MMLU-Redux) (Score (%)): Qwen 3.7 Max (95.0) beat Qwen 3.5 397B A17B (94.9) by 0.1


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.