AI Benchmark Digest — 2026-06-16

        June 16, 2026

AI Benchmark Digest — 2026-06-16

AI Benchmark Digest — 2026-06-16
View on AI Benchmark Hub
Daily
New Benchmarks (7)

SWE-Marathon (Pass@1 (%)): Claude Opus 4.8 leads with 26.0 across 9 models.
  Long-horizon software engineering benchmark where coding agents work on realistic repository tasks under marathon-scale time budgets, reporting pass@1 for end-to-end completed tasks.
InferenceBench (Speedup Score): Claude Fable 5 (Low) leads with 8.74 across 22 models.
  Benchmark for coding agents optimizing inference workloads. Agents tune serving configurations and implementation choices across latency, throughput, and all-in-one scenarios.
AgenticVBench (Average Success (%)): Claude Fable 5 leads with 32.4 across 9 models.
  Agentic video benchmark where autonomous agents perform multi-step video repurposing, sequencing, repair, and assembly tasks, scored by average task success.
TERMS-Bench (Mean Utility): GLM 5.1 leads with 11.7 across 15 models.
  Negotiation benchmark for LLM agents bargaining over terms under changing utility, urgency, and no-deal regimes, reporting mean utility and agreement metrics.
Structured Output Benchmark (Overall (%)): GPT-5.4 leads with 87.0 across 28 models.
  Structured-output benchmark measuring schema-constrained generation with value accuracy, faithfulness, JSON validity, path recall, type safety, and perfect-output rates.
BenGER (Aggregate Accuracy (%)): Gemini 3.1 Pro leads with 77.0 across 12 models.
  German-law benchmark for subsumption-based legal reasoning, evaluating model answers across Benchathon, ZJS, and doctrinal-principles corpora.
BenchLM (Overall Score): Claude Mythos 5 leads with 99.0 across 123 models.
  Composite LLM leaderboard aggregating current model performance across agentic, coding, reasoning, grounded multimodal, knowledge, multilingual, instruction-following, and math categories.

New Scores From Top-10 Models (3)

Claude Fable 5 on Chatbot Arena (Search): 1237.0 Arena Score (#3/31)
Claude Fable 5 on Epoch AI - ECI: 160.87 ECI Score (#3/380)
Claude Opus 4.8 on Chatbot Arena (Search): 1203.0 Arena Score (#11/31)

New #1 Leaders (2)

LLM Stats (MRCR v2) (Score (%)): U2 (76.61) beat Gemma 4 31B (66.4) by 10.21.
Epoch AI - ECI (ECI Score): Claude Fable 5 (Max) (160.87) beat GPT-5.5 Pro (xHigh) (158.9) by 1.97.

                                Don't miss what's next. Subscribe to Mikhail Doroshenko:

            Email address (required)

                    ← Newer

                AI Benchmark Digest — 2026-06-17

                    Older →

                AI Benchmark Digest — 2026-06-15