AI Benchmark Digest — 2026-06-18

        June 18, 2026

AI Benchmark Digest — 2026-06-18

AI Benchmark Digest — 2026-06-18
View on AI Benchmark Hub
Daily
New Benchmarks (9)

AISI Cyber Cooling Tower 10M (Avg Steps (/7)): Claude Opus 4.6 leads with 0.1 across 7 models.
  AISI cyber range: "Cooling Tower" — a 7-step industrial-control-network attack simulation. Reports average steps completed at a 10M token budget.
AISI Cyber Cooling Tower 100M (Avg Steps (/7)): Claude Opus 4.6 leads with 1.4 across 5 models.
  AISI cyber range: "Cooling Tower" — a 7-step industrial-control-network attack simulation. Reports average steps completed at a 100M token budget.
OpenAI CTF (Professional) (pass@12 (%)): GPT-5.5 leads with 96.3 across 3 models.
  OpenAI system-card subset of professional capture-the-flag tasks, reporting pass@12 over offensive-security rollouts with a Linux tool harness.
CVE-Bench (pass@1 (%)): GPT-5.5 leads with 93.1 across 4 models.
  Cybersecurity benchmark for autonomous web vulnerability exploitation across 40 critical CVEs in zero-day and one-day settings.
OpenAI Cyber Ranges (Combined Pass Rate (%)): GPT-5.5 leads with 93.33 across 4 models.
  OpenAI internal cyber-range suite measuring end-to-end cyber operations across realistic emulated networks.
ExploitGym (Successful Intended Exploits (#)): Claude Mythos Preview leads with 157.0 across 7 models.
  Real-world cybersecurity agent benchmark measuring whether AI agents can turn known software vulnerabilities into working, intended exploits across userspace, V8, and Linux kernel targets.
CyScenarioBench (Average Success Rate (%)): Claude Mythos 5 leads with 36.7 across 9 models.
  Irregular scenario-based offensive security benchmark measuring whether agents can plan and complete full multi-stage attack scenarios in realistic environments.
Lyptus Cyber Time Horizons - InterCode-CTF (pass@1 at 2M tokens (%)): Claude Opus 4.6 leads with 100.0 across 3 models.
  Lyptus Research offensive cyber time-horizon run of InterCode-CTF, measuring pass@1 on CTF tasks at a 2M token budget.
Lyptus Cyber Time Horizons - NL2Bash (pass@1 at 2M tokens (%)): GPT-5.3 Codex leads with 100.0 across 3 models.
  Lyptus Research offensive cyber time-horizon run of NL2Bash, measuring command-generation success at a 2M token budget.

New Scores From Top-10 Models (2)

GPT-5.4 Pro on FrontierMath - Tier 4 (v2): 58.54 Accuracy (%, 41 private v2 problems) (#5/30)
GPT-5.4 Pro on FrontierMath - Tiers 1-3 (v2): 82.46 Accuracy (%, 285 private v2 problems) (#4/29)

New #1 Leaders (2)

Terminal-Bench 2.1 (Claude Code) (Accuracy (%)): Claude 5 Fable (83.1) beat Claude Opus 4.8 (78.9) by 4.2.
Terminal-Bench 2.1 (Terminus 2) (Accuracy (%)): Claude 5 Fable (80.4) beat GPT-5.5 (78.2) by 2.2.

                                Don't miss what's next. Subscribe to Mikhail Doroshenko:

            Email address (required)

                    ← Newer

                AI Benchmark Digest — 2026-06-19

                    Older →

                AI Benchmark Digest — 2026-06-17