AI Benchmark Digest — 2026-05-25
AI Benchmark Digest — 2026-05-25
=== DAILY === NEW BENCHMARKS (6) - LLMEval-Logic Base (Accuracy (%)): leader Seed 2.0 Pro (Thinking) (75.5), 14 models - LLMEval-Logic Hard (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (37.5), 14 models - LLMEval-Logic Hard Sub-Q (Accuracy (%)): leader Claude Opus 4.6 (Thinking) (76.6), 14 models - LLMEval-Logic Formalization Free (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (45.1), 14 models - LLMEval-Logic Formalization Fixed (Accuracy (%)): leader GPT-5.4 Pro (No-Think) (60.2), 14 models - ExploitBench v8-bench (Mean Capability (%)): leader Claude Mythos Preview (69.0), 9 models V8 exploitation ladder benchmark measuring how far AI agents climb from code reachability through crash reproduction, exploit primitives, and arbitrary code execution. Reports mean capability across 41 V8 bug environments.