Mikhail Doroshenko

Archives
April 22, 2026

AI Benchmark Digest — 2026-04-22

AI Benchmark Digest — 2026-04-22

=== DAILY === NEW BENCHMARKS (3) - ReasonScape R12 (ReasonScore): leader Qwen3.5-397B-A17B (AWQ, 16k) Thinking (951.64), 67 models - LLM Stats (DeepSearchQA) (Score (%)): leader Claude Opus 4.6 (91.3), 5 models - LLM Stats (MCP-Mark) (Score (%)): leader Kimi K2.6 (55.9), 5 models

NEW MODELS (6) - Ling 2.6 Flash — ELO 1658, #219/891 (above: Grok 4.1 Fast, below: Hermes 4 405B) - JSL-MedMNX-7B — ELO 1343, #669/891 (above: Collaiborator-MEDLLM-Llama-3-8B-v2-6, below: Command R) - Collaiborator-MEDLLM-Llama-3-8B-v2-1 — ELO 1342, #673/891 (above: Collaiborator-MEDLLM-Llama-3-8B, below: Yi-1.5-9B) - JSL-MedMNX-7B-SFT — ELO 1339, #678/891 (above: Llama-3-Orca-1.0-8B, below: Collaiborator-MEDLLM-Llama-3-8B-v2-4) - BioLing-7B-Dare — ELO 1294, #748/891 (above: DeepHermes 3 - Llama-3.1 8B, below: LFM2.5-1.2B-Instruct) - JSL-MedPhi2-2.7B — ELO 1244, #798/891 (above: Phi-3 Mini, below: Gemma 3n E2B)

NEW #1 LEADERS (6) - Chatbot Arena (Text-to-Image) (Elo): gpt-image-2 (medium) (1512.0) beat gemini-3.1-flash-image-preview (nano-banana-2) [web-search] (1264.0) by 248.0 - Chatbot Arena (Image Edit) (Elo): gpt-image-2 (medium) (1513.0) beat chatgpt-image-latest-high-fidelity (20251216) (1392.0) by 121.0 - OSWorld (Success Rate (%)): Holo3-35B-A3B (82.56) beat Opus 4.5 (74.48) by 8.08 - Spider 2.0-DBT (Accuracy (%)): SignalPilot Agent (51.56) beat Databao Agent (44.11) by 7.45 - Design Arena (3D) (Elo): kimi-k2.6 (1381.0) beat claude-opus-4-6 (1376.0) by 5.0 - SEAL Showdown (Arena Score): gemini-3-pro-preview (1306.8) beat gpt-4o-audio-preview-2025-06-03 (1305.3) by 1.5


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.