AI Benchmark Digest — 2026-04-14
AI Benchmark Digest — 2026-04-14
=== DAILY === NEW BENCHMARKS (11) - SpacetimeDB LLM Benchmark (TypeScript) (Task Pass Rate (%)): leader Claude Opus 4.6 (89.4), 10 models - SpacetimeDB LLM Benchmark (C#) (Task Pass Rate (%)): leader Claude Sonnet 4.6 (96.2), 10 models - SpacetimeDB LLM Benchmark (Rust) (Task Pass Rate (%)): leader Claude Opus 4.6 (100.0), 10 models - BridgeBench UI (Score): leader Claude Sonnet 4.6 (81.5), 9 models - BridgeBench Security (Score): leader Claude Sonnet 4.6 (85.3), 19 models - BridgeBench Debugging (Score): leader Claude Opus 4.6 (87.0), 17 models - BridgeBench Refactoring (Score): leader Qwen 3.6 Plus (74.8), 8 models - BridgeBench Hallucination (Score): leader Grok 4.20 Reasoning (91.8), 28 models - AISI Cyber TLO 10M (Avg Steps (/32)): leader Claude Opus 4.6 (9.8), 9 models - AISI Cyber TLO 100M (Avg Steps (/32)): leader Claude Mythos Preview (22.0), 7 models - AISI Cyber CTF (Success Rate (%)): leader GPT-5.3 Codex (75.0), 15 models
NEW MODELS (1) - Step 3.5 Flash 2603 — ELO 1819, #70/925 (above: KAT-Coder-Pro V2, below: GPT-5.4 Mini (Medium))