Mikhail Doroshenko

Archives
Log in
May 25, 2026

AI Benchmark Digest — 2026-05-25

AI Benchmark Digest — 2026-05-25

=== DAILY === NEW BENCHMARKS (6) - LLMEval-Logic Base (Accuracy (%)): leader Seed 2.0 Pro (Thinking) (75.5), 14 models - LLMEval-Logic Hard (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (37.5), 14 models - LLMEval-Logic Hard Sub-Q (Accuracy (%)): leader Claude Opus 4.6 (Thinking) (76.6), 14 models - LLMEval-Logic Formalization Free (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (45.1), 14 models - LLMEval-Logic Formalization Fixed (Accuracy (%)): leader GPT-5.4 Pro (No-Think) (60.2), 14 models - ExploitBench v8-bench (Mean Capability (%)): leader Claude Mythos Preview (69.0), 9 models V8 exploitation ladder benchmark measuring how far AI agents climb from code reachability through crash reproduction, exploit primitives, and arbitrary code execution. Reports mean capability across 41 V8 bug environments.


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.