AI Benchmark Digest — 2026-06-21

        June 21, 2026

AI Benchmark Digest — 2026-06-21

AI Benchmark Digest — 2026-06-21
View on AI Benchmark Hub
Daily
New Benchmarks (19)

Physical AI Bench - Understanding Overall (Overall Score (%)): Cosmos-Reason2-32B leads with 70.8 across 25 models.
  Physical AI Bench understanding track evaluating multimodal reasoning about physical scenarios across robotics, autonomous driving, space, time, and physics.
JMMMU-Pro - Overall (Accuracy (%)): Gemini 3 Pro leads with 87.045 across 14 models.
  JMMMU-Pro overall multimodal reasoning accuracy on Japanese cultural and academic questions.
JMMMU-Pro - Culture Specific (Accuracy (%)): Gemini 3 Pro leads with 95.0 across 14 models.
  JMMMU-Pro culture-specific accuracy on Japanese multimodal questions requiring local cultural knowledge.
JMMMU-Pro - Culture Agnostic (Accuracy (%)): Gemini 3 Pro leads with 80.417 across 14 models.
  JMMMU-Pro culture-agnostic accuracy on Japanese multimodal academic questions.
JMMMU-Pro - Japanese Art (Accuracy (%)): Gemini 3 Pro leads with 91.333 across 14 models.
  JMMMU-Pro Japanese art category accuracy.
JMMMU-Pro - Japanese Heritage (Accuracy (%)): Gemini 3 Pro leads with 96.667 across 14 models.
  JMMMU-Pro Japanese heritage category accuracy.
JMMMU-Pro - Japanese History (Accuracy (%)): Gemini 3 Pro leads with 95.333 across 14 models.
  JMMMU-Pro Japanese history category accuracy.
JMMMU-Pro - World History (Accuracy (%)): Gemini 3 Pro leads with 96.667 across 14 models.
  JMMMU-Pro world history category accuracy.
MMLongBench-Doc - Accuracy (Accuracy (%)): Claude 4.5 Opus leads with 61.9 across 19 models.
  MMLongBench-Doc accuracy for multimodal long-document understanding over lengthy document images and text.
OmniGAIA - Overall (Overall Accuracy (%)): Orchestra-o1-GPT-5 leads with 72.8 across 18 models.
  OmniGAIA overall accuracy on multimodal general-assistant questions spanning geography, technology, history, finance, sports, art, movies, science, and food.
OmniGAIA - Geo (Geography Accuracy (%)): Orchestra-o1-GPT-5 leads with 72.5 across 16 models.
  OmniGAIA geography category accuracy.
OmniGAIA - Tech (Technology Accuracy (%)): Orchestra-o1-GPT-5 leads with 69.4 across 16 models.
  OmniGAIA technology category accuracy.
OmniGAIA - History (History Accuracy (%)): Orchestra-o1-GPT-5 leads with 75.8 across 16 models.
  OmniGAIA history category accuracy.
OmniGAIA - Finance (Finance Accuracy (%)): Gemini-3-Pro leads with 72.0 across 16 models.
  OmniGAIA finance category accuracy.
OmniGAIA - Sport (Sport Accuracy (%)): Orchestra-o1-GPT-5 leads with 83.8 across 16 models.
  OmniGAIA sports category accuracy.
OmniGAIA - Art (Art Accuracy (%)): Orchestra-o1-GPT-5 leads with 63.9 across 16 models.
  OmniGAIA art category accuracy.
OmniGAIA - Movie (Movie Accuracy (%)): Orchestra-o1-GPT-5 leads with 69.7 across 16 models.
  OmniGAIA movie category accuracy.
OmniGAIA - Science (Science Accuracy (%)): Orchestra-o1-GPT-5 leads with 73.1 across 16 models.
  OmniGAIA science category accuracy.
OmniGAIA - Food (Food Accuracy (%)): Gemini-3-Pro leads with 88.9 across 16 models.
  OmniGAIA food category accuracy.

Weekly
New Benchmarks (39)

OpenAI GPT-5 System Card - HealthBench (Score (%)): GPT-5 (Thinking) leads with 67.2 across 7 models.
  OpenAI GPT-5 system-card benchmark for HealthBench.
OpenAI GPT-5 System Card - HealthBench Hard (Score (%)): GPT-5 (Thinking) leads with 46.2 across 7 models.
  OpenAI GPT-5 system-card benchmark for HealthBench Hard.
OpenAI GPT-5 System Card - HealthBench Consensus (Score (%)): GPT-5 Mini (Thinking) leads with 96.5 across 7 models.
  OpenAI GPT-5 system-card benchmark for HealthBench Consensus.
OpenAI GPT-5 System Card - MMLU Language Arabic (Accuracy): O3 (High) leads with 0.904 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Arabic.
OpenAI GPT-5 System Card - MMLU Language Bengali (Accuracy): GPT-5 (Thinking) leads with 0.892 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Bengali.
OpenAI GPT-5 System Card - MMLU Language Chinese (Accuracy): GPT-5 (Thinking) leads with 0.902 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Chinese.
OpenAI GPT-5 System Card - MMLU Language French (Accuracy): O3 (High) leads with 0.906 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language French.
OpenAI GPT-5 System Card - MMLU Language German (Accuracy): O3 (High) leads with 0.905 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language German.
OpenAI GPT-5 System Card - MMLU Language Hindi (Accuracy): GPT-5 (Thinking) leads with 0.899 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Hindi.
OpenAI GPT-5 System Card - MMLU Language Indonesian (Accuracy): GPT-5 (Thinking) leads with 0.909 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Indonesian.
OpenAI GPT-5 System Card - MMLU Language Italian (Accuracy): O3 (High) leads with 0.912 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Italian.
OpenAI GPT-5 System Card - MMLU Language Japanese (Accuracy): GPT-5 (Thinking) leads with 0.898 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Japanese.
OpenAI GPT-5 System Card - MMLU Language Korean (Accuracy): GPT-5 (Thinking) leads with 0.896 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Korean.
OpenAI GPT-5 System Card - MMLU Language Portuguese (Accuracy): GPT-5 (Thinking) leads with 0.91 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Portuguese.
OpenAI GPT-5 System Card - MMLU Language Spanish (Accuracy): O3 (High) leads with 0.911 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Spanish.
OpenAI GPT-5 System Card - MMLU Language Swahili (Accuracy): GPT-5 (Thinking) leads with 0.88 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Swahili.
OpenAI GPT-5 System Card - MMLU Language Yoruba (Accuracy): GPT-5 (Thinking) leads with 0.806 across 3 models.
  OpenAI GPT-5 system-card benchmark for MMLU Language Yoruba.
OpenAI GPT-5 System Card - BBQ Ambiguous (Accuracy): GPT-5 (Thinking) leads with 0.93 across 3 models.
  OpenAI GPT-5 system-card benchmark for BBQ Ambiguous.
OpenAI GPT-5 System Card - BBQ Disambiguated (Accuracy): GPT-5 (Thinking) leads with 0.88 across 3 models.
  OpenAI GPT-5 system-card benchmark for BBQ Disambiguated.
Physical AI Bench - Understanding Overall (Overall Score (%)): Cosmos-Reason2-32B leads with 70.8 across 25 models.
  Physical AI Bench understanding track evaluating multimodal reasoning about physical scenarios across robotics, autonomous driving, space, time, and physics.
JMMMU-Pro - Overall (Accuracy (%)): Gemini 3 Pro leads with 87.045 across 14 models.
  JMMMU-Pro overall multimodal reasoning accuracy on Japanese cultural and academic questions.
JMMMU-Pro - Culture Specific (Accuracy (%)): Gemini 3 Pro leads with 95.0 across 14 models.
  JMMMU-Pro culture-specific accuracy on Japanese multimodal questions requiring local cultural knowledge.
JMMMU-Pro - Culture Agnostic (Accuracy (%)): Gemini 3 Pro leads with 80.417 across 14 models.
  JMMMU-Pro culture-agnostic accuracy on Japanese multimodal academic questions.
JMMMU-Pro - Japanese Art (Accuracy (%)): Gemini 3 Pro leads with 91.333 across 14 models.
  JMMMU-Pro Japanese art category accuracy.
JMMMU-Pro - Japanese Heritage (Accuracy (%)): Gemini 3 Pro leads with 96.667 across 14 models.
  JMMMU-Pro Japanese heritage category accuracy.
JMMMU-Pro - Japanese History (Accuracy (%)): Gemini 3 Pro leads with 95.333 across 14 models.
  JMMMU-Pro Japanese history category accuracy.
JMMMU-Pro - World History (Accuracy (%)): Gemini 3 Pro leads with 96.667 across 14 models.
  JMMMU-Pro world history category accuracy.
MMLongBench-Doc - Accuracy (Accuracy (%)): Claude 4.5 Opus leads with 61.9 across 19 models.
  MMLongBench-Doc accuracy for multimodal long-document understanding over lengthy document images and text.
LLM Stats (TAU3-Bench) (Score (%)): MiMo-V2.5-Pro leads with 72.9 across 5 models.
  LLM Stats aggregate of Tau3-Bench agentic customer-service tasks across retail, telecom, airline, and banking-knowledge domains.
OmniGAIA - Overall (Overall Accuracy (%)): Orchestra-o1-GPT-5 leads with 72.8 across 18 models.
  OmniGAIA overall accuracy on multimodal general-assistant questions spanning geography, technology, history, finance, sports, art, movies, science, and food.
OmniGAIA - Geo (Geography Accuracy (%)): Orchestra-o1-GPT-5 leads with 72.5 across 16 models.
  OmniGAIA geography category accuracy.
OmniGAIA - Tech (Technology Accuracy (%)): Orchestra-o1-GPT-5 leads with 69.4 across 16 models.
  OmniGAIA technology category accuracy.
OmniGAIA - History (History Accuracy (%)): Orchestra-o1-GPT-5 leads with 75.8 across 16 models.
  OmniGAIA history category accuracy.
OmniGAIA - Finance (Finance Accuracy (%)): Gemini-3-Pro leads with 72.0 across 16 models.
  OmniGAIA finance category accuracy.
OmniGAIA - Sport (Sport Accuracy (%)): Orchestra-o1-GPT-5 leads with 83.8 across 16 models.
  OmniGAIA sports category accuracy.
OmniGAIA - Art (Art Accuracy (%)): Orchestra-o1-GPT-5 leads with 63.9 across 16 models.
  OmniGAIA art category accuracy.
OmniGAIA - Movie (Movie Accuracy (%)): Orchestra-o1-GPT-5 leads with 69.7 across 16 models.
  OmniGAIA movie category accuracy.
OmniGAIA - Science (Science Accuracy (%)): Orchestra-o1-GPT-5 leads with 73.1 across 16 models.
  OmniGAIA science category accuracy.
OmniGAIA - Food (Food Accuracy (%)): Gemini-3-Pro leads with 88.9 across 16 models.
  OmniGAIA food category accuracy.

New Models (35)

GPT-5.4 Pro (xHigh) — ELO 2980, #6/1344, above Gemini 3 Deep Think, below Claude Fable 5
FrontierMath - Tiers 1-3 (v2): 82.46 (#4/30)
FrontierMath - Tier 4 (v2): 58.54 (#5/31)

Claude Fable 5 — ELO 2955, #7/1344, above GPT-5.4 Pro, below Qwen 3.7 Max
Chatbot Arena (Search): 1237.0 (#3/31)
Epoch AI - ECI: 160.87 (#4/381)

Qwen 3.7 Max — ELO 2748, #8/1344, above Claude Fable 5, below Claude Opus 4.8
LLM Stats (GDPval-AA): 1308.0 (#12/33)

Claude Opus 4.8 — ELO 2678, #9/1344, above Qwen 3.7 Max, below GPT-5.5
Vals AI Vibe Code Bench: 82.72 (#2/66)
Vals AI Terminal-Bench 2.1: 71.91 (#4/35)
Chatbot Arena (Search): 1203.0 (#11/31)

GPT-5.5 — ELO 2502, #10/1344, above Claude Opus 4.8, below Claude Opus 4.6
LLM Stats (GDPval-AA): 1135.0 (#23/33)

GPT-5.4 — ELO 2370, #15/1344, above Gemini 3.1 Pro (Preview), below GPT-5.3 Codex
LLM Stats (GDPval-AA): 1429.0 (#6/33)

Nemotron 3 Ultra — ELO 2352, #18/1344, above GPT-5 Pro, below Claude Opus 4.7
LLM Stats (IMO-AnswerBench): 92.3 (#1/18)
LLM Stats (LongBench v2): 61.9 (#3/16)
LLM Stats (MMLU-ProX): 83.0 (#5/32)
LLM Stats (Multi-Challenge): 63.8 (#6/29)
LLM Stats (WMT24++): 83.7 (#6/23)
LLM Stats (Finance Agent): 53.7 (#8/8)
LLM Stats (GDPval-AA): 1183.0 (#18/33)
ZeroEval GPQA Diamond: 87.0 (#34/226)
LLM Stats (BrowseComp): 44.4 (#40/49)

Claude Opus 4.7 — ELO 2349, #19/1344, above Nemotron 3 Ultra, below GPT-5.2 Pro
SEAL - SWE Atlas - Codebase QnA: 40.32 (#4/16)
LLM Stats (GDPval-AA): 1542.0 (#4/33)
SEAL - SWE Atlas - Test Writing: 38.52 (#7/17)

Gemini 3.5 Flash — ELO 2333, #22/1344, above GPT-5, below Qwen 3.7 Plus
EQ-Bench Longform Writing: 71.8 (#17/116)

Qwen 3.7 Plus — ELO 2325, #23/1344, above Gemini 3.5 Flash, below GLM-5.2
LLM Stats (DeepPlanning): 62.3 (#1/9)
LLM Stats (ERQA): 69.8 (#1/20)
LLM Stats (LVBench): 76.2 (#1/21)
LLM Stats (MLVU): 87.4 (#1/10)
LLM Stats (MRCR v2): 91.7 (#1/8)
LLM Stats (RealWorldQA): 86.9 (#1/23)
LLM Stats (SimpleVQA): 81.7 (#1/11)
LLM Stats (Video-MME): 88.0 (#1/15)
LLM Stats (MathVision): 90.3 (#2/29)
LLM Stats (MAXIFE): 88.8 (#2/11)

GLM-5.2 — ELO 2280, #24/1344, above Qwen 3.7 Plus, below GPT-5.2
LLM Stats (AIME 2026): 99.2 (#1/17)
LLM Stats (NL2Repo): 48.9 (#1/9)
NYT Connections Older Models: 92.7 (#1/108)
Vending-Bench 2: 8313.78 (#2/49)
LLM Stats (IMO-AnswerBench): 91.0 (#2/18)
LiveBench Python: 90.0 (#3/126)
FrontierSWE: 74.0 (#3/14)
LiveBench TypeScript: 65.0 (#4/125)
LLM Stats (MCP Atlas): 76.8 (#4/25)
RuneBench: 3230.0 (#4/25)

Qwen Max — ELO 2249, #27/1344, above O3 Pro, below Gemini 3 Flash
Epoch AI - ECI: 154.12 (#48/381)

DeepSeek V4 Pro — ELO 2226, #32/1344, above Seed 2.0 Pro, below Claude Opus 4.5
SEAL - SWE Atlas - Codebase QnA: 27.15 (#10/16)
SEAL - SWE Atlas - Test Writing: 27.05 (#15/17)

Kimi K2.6 — ELO 2179, #40/1344, above Gemini 3 Flash (Preview), below Claude Opus 4.5 (20251101)
Vals AI Multimodal Index: 56.43 (#8/21)
Vals AI CorpFin v2: 66.74 (#9/116)
Vals AI LiveCodeBench: 86.77 (#9/122)
Vals AI Finance Agent v2: 44.9 (#10/28)
Vals AI (Vals Index): 55.17 (#11/30)
Vals AI SAGE: 50.22 (#11/61)
Vals AI MMMU: 86.3 (#11/76)
Vals AI Finance Agent: 57.06 (#12/51)
Vals AI Terminal-Bench 2.0: 57.3 (#13/68)
Vals AI MMLU-Pro: 87.57 (#13/115)

MiniMax-M3 — ELO 2169, #42/1344, above Claude Opus 4.5 (20251101), below GPT-5.1
LLM Stats (GDPval-AA): 1431.0 (#5/33)
Vellum - GPQA: 93.0 (#7/58)
Vellum - HumanEval: 80.5 (#8/39)
Vending-Bench 2: 2157.77 (#31/49)
NYT Connections Extended: 74.2 (#31/85)

kimi-k2.7-code — ELO 2152, #46/1344, above Rombos-LLM-V2.5-Qwen-72B, below O3
RuneBench: 3099.0 (#6/25)
Lynchmark: 75.0 (#10/15)
Agent Arena - Steerability: 7.31 (#12/28)
Vending-Bench 2: 5082.94 (#15/49)
AA GDPval: 1198.9 (#18/52)
GDPval-AA: 1199.0 (#18/52)
Chatbot Arena (Code): 1478.0 (#19/89)
AA CritPt: 10.0 (#20/414)
Agent Arena - Confirmed Success: 3.22 (#21/28)
AA Omniscience - Science, Engineering & Mathematics: 44.8 (#21/414)

Grok 4.3 — ELO 2101, #62/1344, above Doubao-1.5-Pro, below Gemini 2.5 Pro
LLM Stats (GDPval-AA): 1100.0 (#25/33)

GLM-5.1 — ELO 2097, #64/1344, above Gemini 2.5 Pro, below O3 (2025-04-16)
Vals AI Finance Agent: 57.66 (#10/51)
Vals AI Finance Agent v2: 44.79 (#11/28)
Vals AI SWE-bench Verified: 76.4 (#12/46)
Vals AI (Vals Index): 52.45 (#13/30)
LLM Stats (GDPval-AA): 1281.0 (#14/33)
Vals AI ProofBench: 22.22 (#15/43)
Vals AI LegalBench: 84.39 (#17/119)
Vals AI Terminal-Bench 2.0: 53.93 (#17/68)
Vals AI Terminal-Bench 2.1: 56.93 (#17/35)
Vals AI MMLU-Pro: 86.9 (#23/115)

Step 3.7 Flash — ELO 2044, #85/1344, above Grok 4.20, below Gemini 3.1 Flash Lite
AA APEX-Agents: 14.82 (#15/25)

Qwen 3.6 Plus — ELO 2020, #94/1344, above Claude Sonnet 4 (20250514), below Qwen 3.5 397B A17B
LLM Stats (GDPval-AA): 1160.0 (#21/33)

Qwen 3.5 397B A17B — ELO 2020, #95/1344, above Qwen 3.6 Plus, below Gemini 2.5 Pro (Preview 06-05)
LLM Stats (GDPval-AA): 961.0 (#29/33)

GPT-5.4 Mini — ELO 2005, #106/1344, above MiniMax M1 40k, below Claude Opus 4
LLM Stats (GDPval-AA): 1190.0 (#17/33)

Qwen 3.6 27B — ELO 1989, #110/1344, above Gemma 4 31B (IT), below GPT-5.1 Codex Mini
LLM Stats (GDPval-AA): 1158.0 (#22/33)

Qwen 3.5 122B A10B — ELO 1981, #115/1344, above O4 Mini, below GPT-5.3 Chat
LLM Stats (GDPval-AA): 985.0 (#27/33)

Command A+ — ELO 1876, #171/1344, above Mistral Medium 3.1, below Grok 4.1
LLM Stats (MathVista): 80.6 (#4/36)
LLM Stats (CharXiv-D): 88.0 (#5/14)
LLM Stats (WMT24++): 81.0 (#7/23)
LLM Stats (CharXiv-R): 52.7 (#35/40)

GPT-5.4 Nano — ELO 1860, #178/1344, above MiMo-V2-Flash, below Qwen 3 Next 80B A3B
LLM Stats (GDPval-AA): 1115.0 (#24/33)

MiniMax-M2.5 — ELO 1837, #199/1344, above Gemini 2.0 Flash, below Ling-2.6-1T
Vals AI SWE-bench Verified: 74.2 (#18/46)
Vals AI Terminal-Bench 2.0: 41.57 (#30/68)
Vals AI MedQA: 92.53 (#31/95)
Vals AI IOI: 6.67 (#34/55)
Vals AI ProofBench: 4.0 (#39/43)
Vals AI GPQA: 82.07 (#39/116)
Vals AI CaseLaw v2: 53.48 (#40/54)
Vals AI Finance Agent: 38.58 (#41/51)
Vals AI LiveCodeBench: 79.21 (#53/122)
Vals AI CorpFin v2: 59.6 (#60/116)

Claude Haiku 4.5 — ELO 1831, #203/1344, above GPT-OSS-120B, below Gemini 2.5 Flash Lite (Preview 09-2025)
LLM Stats (GDPval-AA): 902.0 (#32/33)

Mistral Medium 3.5 — ELO 1793, #228/1344, above Trinity Large, below GPT-4 (0613)
LLM Stats (GDPval-AA): 926.0 (#31/33)

Gemma 4 31B — ELO 1736, #266/1344, above Qwen 3 32B, below Qwen2.5-32B-Instruct-CFT
LLM Stats (GDPval-AA): 783.0 (#33/33)

Qwen 3.6 35B A3B — ELO 1723, #275/1344, above Gemma 3 27B (IT), below trinity-large-preview
LLM Stats (GDPval-AA): 1056.0 (#26/33)

nemotron-3-ultra-550B-a55B — ELO 1712, #285/1344, above Qwen 3.5 35B A3B, below Solar Open 100B
Design Arena (3D): 1203.0 (#51/120)
Design Arena (Game Dev): 1200.0 (#67/132)
Design Arena (UI Components): 1162.0 (#74/126)
Design Arena (Data Viz): 1139.0 (#90/128)

Laguna XS.2 — ELO 1664, #341/1344, above Gemini 2.0 Flash Lite (001), below Qwen 3.5 9B
Chatbot Arena (Code): 1298.0 (#70/89)

Laguna M.1 — ELO 1651, #358/1344, above Llama 2 70B, below Virtuoso-Lite
Chatbot Arena (Code): 1347.0 (#60/89)

Gemma 4 12B — ELO 1400, #813/1344, above Yi 34B (Base), below maestrale-chat-v0.4-beta
Wolfram LLM Benchmarking Project: 22.8 (#380/489)

New Scores From Top-10 Models (9)

Claude Fable 5 on Chatbot Arena (Search): 1237.0 Arena Score (#3/31)
Claude Fable 5 on Epoch AI - ECI: 160.87 ECI Score (#4/381)
Claude Opus 4.8 on Chatbot Arena (Search): 1203.0 Arena Score (#11/31)
Claude Opus 4.8 on Vals AI Terminal-Bench 2.1: 71.91 Accuracy (%) (#4/35)
Claude Opus 4.8 on Vals AI Vibe Code Bench: 82.72 Accuracy (%) (#2/66)
GPT-5.4 Pro on FrontierMath - Tier 4 (v2): 58.54 Accuracy (%, 41 private v2 problems) (#5/31)
GPT-5.4 Pro on FrontierMath - Tiers 1-3 (v2): 82.46 Accuracy (%, 285 private v2 problems) (#4/30)
GPT-5.5 on LLM Stats (GDPval-AA): 1135.0 ELO (#23/33)
Qwen 3.7 Max on LLM Stats (GDPval-AA): 1308.0 ELO (#12/33)

New #1 Leaders (24)

WDCD R3 Pressure Integrity (Score (%)): Qwen 3 Max (190.0) beat Claude Opus 4.7 (100.0) by 90.0.
LLM Stats (MRCR v2) (Score (%)): Qwen 3.7 Plus (91.7) beat Gemma 4 31B (66.4) by 25.3.
LLM Stats (DeepPlanning) (Score (%)): Qwen 3.7 Plus (62.3) beat Qwen 3.6 Plus (41.5) by 20.8.
Coding Agent Leaderboard - swe-bench-pro--ansible (Score (%)): Opus 4.8 + Claude Code (69.8) beat Sonnet 4.6 + Claude Code (50.0) by 19.8.
Coding Agent Leaderboard (Score (%)): Opus 4.8 + Claude Code (78.3) beat Sonnet 4.6 + Claude Code (64.8) by 13.5.
Design Arena (Website) (Elo): silo (1357.0) beat Claude Fable 5 (1345.0) by 12.0.
Coding Agent Leaderboard - swe-bench-verified (Score (%)): Opus 4.8 + Claude Code (86.8) beat Sonnet 4.6 + Claude Code (79.6) by 7.2.
WDCD R2 In-Document Resistance (Score (%)): Gemini 2.5 Pro (90.0) beat Grok 4 (84.0) by 6.0.
Agent Security League - Security Correctness (Security Correctness (%)): Claude Fable 5 (29.0) beat GPT-5.5 (24.0) by 5.0.
Terminal-Bench 2.1 (Claude Code) (Accuracy (%)): Claude 5 Fable (83.1) beat Claude Opus 4.8 (78.9) by 4.2.
LLM Stats (ERQA) (Score (%)): Qwen 3.7 Plus (69.8) beat Qwen 3.6 Plus (65.7) by 4.1.
LLM Stats (SimpleVQA) (Score (%)): Qwen 3.7 Plus (81.7) beat GLM-5V Turbo (78.2) by 3.5.
LLM Stats (AIME 2026) (Score (%)): GLM-5.2 (99.2) beat Kimi K2.6 (96.4) by 2.8.
LLM Stats (IMO-AnswerBench) (Score (%)): Nemotron 3 Ultra (92.3) beat Qwen 3.7 Max (90.0) by 2.3.
Terminal-Bench 2.1 (Terminus 2) (Accuracy (%)): Claude 5 Fable (80.4) beat GPT-5.5 (78.2) by 2.2.
Epoch AI - ECI (ECI Score): Claude Fable 5 (Max) (160.87) beat GPT-5.5 Pro (xHigh) (158.9) by 1.97.
LLM Stats (NL2Repo) (Score (%)): GLM-5.2 (48.9) beat Qwen 3.7 Max (47.2) by 1.7.
LLM Stats (RealWorldQA) (Score (%)): Qwen 3.7 Plus (86.9) beat Qwen 3.6 Plus (85.4) by 1.5.
Wolfram LLM Benchmarking Project (Correct Functionality (%)): Claude Fable 5 thinking max (73.3) beat Claude Opus 4.7 (Thinking) (72.5) by 0.8.
LLM Stats (LVBench) (Score (%)): Qwen 3.7 Plus (76.2) beat Kimi K2.5 (75.9) by 0.3.
LLM Stats (Video-MME) (Score (%)): Qwen 3.7 Plus (88.0) beat MiMo-V2.5 (87.7) by 0.3.
NYT Connections Older Models (Score (%)): GLM-5.2 (92.7) beat Sherlock Think Alpha (92.5) by 0.2.
Agent Arena - Tool Hallucination (Tool Hallucination (%)): Grok 4.3 (High) (0.11) beat Grok 4.3 xAI · Proprietary (0.26) by 0.15.
LLM Stats (MLVU) (Score (%)): Qwen 3.7 Plus (87.4) beat Qwen 3.5 122B A10B (87.3) by 0.1.

                                Don't miss what's next. Subscribe to Mikhail Doroshenko:

            Email address (required)

                    ← Newer

                AI Benchmark Digest — 2026-06-23

                    Older →

                AI Benchmark Digest — 2026-06-20