Mikhail Doroshenko

Archives
Log in
Subscribe
June 16, 2026

AI Benchmark Digest — 2026-06-16

AI Benchmark Digest — 2026-06-16

View on AI Benchmark Hub

Daily

New Benchmarks (7)

  • SWE-Marathon (Pass@1 (%)): Claude Opus 4.8 leads with 26.0 across 9 models. Long-horizon software engineering benchmark where coding agents work on realistic repository tasks under marathon-scale time budgets, reporting pass@1 for end-to-end completed tasks.
  • InferenceBench (Speedup Score): Claude Fable 5 (Low) leads with 8.74 across 22 models. Benchmark for coding agents optimizing inference workloads. Agents tune serving configurations and implementation choices across latency, throughput, and all-in-one scenarios.
  • AgenticVBench (Average Success (%)): Claude Fable 5 leads with 32.4 across 9 models. Agentic video benchmark where autonomous agents perform multi-step video repurposing, sequencing, repair, and assembly tasks, scored by average task success.
  • TERMS-Bench (Mean Utility): GLM 5.1 leads with 11.7 across 15 models. Negotiation benchmark for LLM agents bargaining over terms under changing utility, urgency, and no-deal regimes, reporting mean utility and agreement metrics.
  • Structured Output Benchmark (Overall (%)): GPT-5.4 leads with 87.0 across 28 models. Structured-output benchmark measuring schema-constrained generation with value accuracy, faithfulness, JSON validity, path recall, type safety, and perfect-output rates.
  • BenGER (Aggregate Accuracy (%)): Gemini 3.1 Pro leads with 77.0 across 12 models. German-law benchmark for subsumption-based legal reasoning, evaluating model answers across Benchathon, ZJS, and doctrinal-principles corpora.
  • BenchLM (Overall Score): Claude Mythos 5 leads with 99.0 across 123 models. Composite LLM leaderboard aggregating current model performance across agentic, coding, reasoning, grounded multimodal, knowledge, multilingual, instruction-following, and math categories.

New Scores From Top-10 Models (3)

  • Claude Fable 5 on Chatbot Arena (Search): 1237.0 Arena Score (#3/31)
  • Claude Fable 5 on Epoch AI - ECI: 160.87 ECI Score (#3/380)
  • Claude Opus 4.8 on Chatbot Arena (Search): 1203.0 Arena Score (#11/31)

New #1 Leaders (2)

  • LLM Stats (MRCR v2) (Score (%)): U2 (76.61) beat Gemma 4 31B (66.4) by 10.21.
  • Epoch AI - ECI (ECI Score): Claude Fable 5 (Max) (160.87) beat GPT-5.5 Pro (xHigh) (158.9) by 1.97.
Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.