AI Benchmark Digest — 2026-06-16
AI Benchmark Digest — 2026-06-16
Daily
New Benchmarks (7)
- SWE-Marathon (Pass@1 (%)): Claude Opus 4.8 leads with 26.0 across 9 models. Long-horizon software engineering benchmark where coding agents work on realistic repository tasks under marathon-scale time budgets, reporting pass@1 for end-to-end completed tasks.
- InferenceBench (Speedup Score): Claude Fable 5 (Low) leads with 8.74 across 22 models. Benchmark for coding agents optimizing inference workloads. Agents tune serving configurations and implementation choices across latency, throughput, and all-in-one scenarios.
- AgenticVBench (Average Success (%)): Claude Fable 5 leads with 32.4 across 9 models. Agentic video benchmark where autonomous agents perform multi-step video repurposing, sequencing, repair, and assembly tasks, scored by average task success.
- TERMS-Bench (Mean Utility): GLM 5.1 leads with 11.7 across 15 models. Negotiation benchmark for LLM agents bargaining over terms under changing utility, urgency, and no-deal regimes, reporting mean utility and agreement metrics.
- Structured Output Benchmark (Overall (%)): GPT-5.4 leads with 87.0 across 28 models. Structured-output benchmark measuring schema-constrained generation with value accuracy, faithfulness, JSON validity, path recall, type safety, and perfect-output rates.
- BenGER (Aggregate Accuracy (%)): Gemini 3.1 Pro leads with 77.0 across 12 models. German-law benchmark for subsumption-based legal reasoning, evaluating model answers across Benchathon, ZJS, and doctrinal-principles corpora.
- BenchLM (Overall Score): Claude Mythos 5 leads with 99.0 across 123 models. Composite LLM leaderboard aggregating current model performance across agentic, coding, reasoning, grounded multimodal, knowledge, multilingual, instruction-following, and math categories.
New Scores From Top-10 Models (3)
- Claude Fable 5 on Chatbot Arena (Search): 1237.0 Arena Score (#3/31)
- Claude Fable 5 on Epoch AI - ECI: 160.87 ECI Score (#3/380)
- Claude Opus 4.8 on Chatbot Arena (Search): 1203.0 Arena Score (#11/31)
New #1 Leaders (2)
- LLM Stats (MRCR v2) (Score (%)): U2 (76.61) beat Gemma 4 31B (66.4) by 10.21.
- Epoch AI - ECI (ECI Score): Claude Fable 5 (Max) (160.87) beat GPT-5.5 Pro (xHigh) (158.9) by 1.97.
Don't miss what's next. Subscribe to Mikhail Doroshenko: