Mikhail Doroshenko

Archives
Log in
Subscribe
June 18, 2026

AI Benchmark Digest — 2026-06-18

AI Benchmark Digest — 2026-06-18

View on AI Benchmark Hub

Daily

New Benchmarks (9)

  • AISI Cyber Cooling Tower 10M (Avg Steps (/7)): Claude Opus 4.6 leads with 0.1 across 7 models. AISI cyber range: "Cooling Tower" — a 7-step industrial-control-network attack simulation. Reports average steps completed at a 10M token budget.
  • AISI Cyber Cooling Tower 100M (Avg Steps (/7)): Claude Opus 4.6 leads with 1.4 across 5 models. AISI cyber range: "Cooling Tower" — a 7-step industrial-control-network attack simulation. Reports average steps completed at a 100M token budget.
  • OpenAI CTF (Professional) (pass@12 (%)): GPT-5.5 leads with 96.3 across 3 models. OpenAI system-card subset of professional capture-the-flag tasks, reporting pass@12 over offensive-security rollouts with a Linux tool harness.
  • CVE-Bench (pass@1 (%)): GPT-5.5 leads with 93.1 across 4 models. Cybersecurity benchmark for autonomous web vulnerability exploitation across 40 critical CVEs in zero-day and one-day settings.
  • OpenAI Cyber Ranges (Combined Pass Rate (%)): GPT-5.5 leads with 93.33 across 4 models. OpenAI internal cyber-range suite measuring end-to-end cyber operations across realistic emulated networks.
  • ExploitGym (Successful Intended Exploits (#)): Claude Mythos Preview leads with 157.0 across 7 models. Real-world cybersecurity agent benchmark measuring whether AI agents can turn known software vulnerabilities into working, intended exploits across userspace, V8, and Linux kernel targets.
  • CyScenarioBench (Average Success Rate (%)): Claude Mythos 5 leads with 36.7 across 9 models. Irregular scenario-based offensive security benchmark measuring whether agents can plan and complete full multi-stage attack scenarios in realistic environments.
  • Lyptus Cyber Time Horizons - InterCode-CTF (pass@1 at 2M tokens (%)): Claude Opus 4.6 leads with 100.0 across 3 models. Lyptus Research offensive cyber time-horizon run of InterCode-CTF, measuring pass@1 on CTF tasks at a 2M token budget.
  • Lyptus Cyber Time Horizons - NL2Bash (pass@1 at 2M tokens (%)): GPT-5.3 Codex leads with 100.0 across 3 models. Lyptus Research offensive cyber time-horizon run of NL2Bash, measuring command-generation success at a 2M token budget.

New Scores From Top-10 Models (2)

  • GPT-5.4 Pro on FrontierMath - Tier 4 (v2): 58.54 Accuracy (%, 41 private v2 problems) (#5/30)
  • GPT-5.4 Pro on FrontierMath - Tiers 1-3 (v2): 82.46 Accuracy (%, 285 private v2 problems) (#4/29)

New #1 Leaders (2)

  • Terminal-Bench 2.1 (Claude Code) (Accuracy (%)): Claude 5 Fable (83.1) beat Claude Opus 4.8 (78.9) by 4.2.
  • Terminal-Bench 2.1 (Terminus 2) (Accuracy (%)): Claude 5 Fable (80.4) beat GPT-5.5 (78.2) by 2.2.
Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.