AI Benchmark Digest — 2026-05-13
AI Benchmark Digest — 2026-05-13
=== DAILY === NEW BENCHMARKS (2) - ProgramBench (Resolved (%)): leader GPT-5.5 (xHigh) (0.5), 13 models Meta and Stanford benchmark testing whether language-model agents can rebuild complete programs from only a compiled binary and documentation. Agents use mini-SWE-agent across 200 open-source program recreation tasks and are scored by hidden behavioral tests. - ProgramBench Almost (Almost (%)): leader GPT-5.5 (xHigh) (13.5), 13 models Companion ProgramBench metric that counts near-complete program recreations: tasks where the generated implementation passes most hidden behavioral tests but does not fully resolve the benchmark task.
NEW MODELS (1) - JT-35B-Flash — ELO 1693, #141/799 (above: Qwen 3.6 27B (Reasoning), below: Claude 3.7 Sonnet (Thinking 16K)) AA TAU-2 Bench: 99.1 (#1/405) AA Omniscience - Software Engineering (SWE) - Go: 36.0 (#50/391) AA Omniscience - Software Engineering (SWE) - Java: 29.0 (#58/391) AA Omniscience - Software Engineering (SWE) - HTML: 48.0 (#60/391) AA Omniscience - Software Engineering (SWE) - JavaScript: 41.82 (#75/391) AA GPQA Diamond: 82.9 (#76/486) AA Omniscience - Software Engineering (SWE) - C: 53.0 (#78/391) AA Omniscience - Software Engineering (SWE) - PHP: 38.0 (#79/391) AA Omniscience - Software Engineering (SWE) - TypeScript: 36.67 (#82/391) AA Omniscience - Software Engineering (SWE): 35.0 (#83/391)
NEW SCORES FROM TOP-10 MODELS (1) - GPT-5.5 (xHigh) on WeirdML: 84.91 Average Score (#1/121)
NEW #1 LEADERS (2) - WeirdML (Average Score): gpt-5.5 (xhigh) (84.91) beat gpt-5.5 (high) (83.9) by 1.01 - AA TAU-2 Bench (Accuracy (%)): JT-35B-Flash (99.1) beat GLM-4.7-Flash (Reasoning) (98.8) by 0.3