AI Benchmark Digest — 2026-05-01
AI Benchmark Digest — 2026-05-01
=== DAILY === NEW BENCHMARKS (56) - LIBRA - Passkey (Dataset Total Score (%)): leader GLM-4 9B Chat (100.0), 17 models - LIBRA - MatreshkaYesNo (Dataset Total Score (%)): leader GPT-4o (80.0), 17 models - LIBRA - MatreshkaNames (Dataset Total Score (%)): leader GPT-4o (51.67), 17 models - LIBRA - PasskeyWithLibrusec (Dataset Total Score (%)): leader GLM-4 9B Chat (100.0), 17 models - LIBRA - LibrusecHistory (Dataset Total Score (%)): leader GPT-4o (97.5), 17 models - LIBRA - ruGSM100 (Dataset Total Score (%)): leader GPT-4o (100.0), 17 models - LIBRA - ruSciPassageCount (Dataset Total Score (%)): leader GPT-4o (35.0), 17 models - LIBRA - ru2WikiMultihopQA (Dataset Total Score (%)): leader GPT-4o (76.67), 17 models - LIBRA - LongContextMultiQ (Dataset Total Score (%)): leader GPT-4o (36.67), 17 models - LIBRA - ruSciAbstractRetrieval (Dataset Total Score (%)): leader GLM-4 9B Chat (77.81), 17 models - LIBRA - ruTREC (Dataset Total Score (%)): leader GPT-4o (75.0), 17 models - LIBRA - ruSciFi (Dataset Total Score (%)): leader GPT-4o (75.0), 17 models - LIBRA - LibrusecMHQA (Dataset Total Score (%)): leader GPT-4o (50.0), 17 models - LIBRA - ruBABILongQA1 (Dataset Total Score (%)): leader GPT-4o (78.33), 17 models - LIBRA - ruBABILongQA2 (Dataset Total Score (%)): leader GPT-4o (36.67), 17 models - LIBRA - ruBABILongQA3 (Dataset Total Score (%)): leader Llama 3.1 8B (29.65), 17 models - LIBRA - ruBABILongQA4 (Dataset Total Score (%)): leader GPT-4o (78.95), 17 models - LIBRA - ruBABILongQA5 (Dataset Total Score (%)): leader GPT-4o (90.0), 17 models - LIBRA - ruQuALITY (Dataset Total Score (%)): leader GPT-4o (83.33), 17 models - LIBRA - ruTPO (Dataset Total Score (%)): leader GPT-4o (100.0), 17 models - LIBRA - ruQasper (Dataset Total Score (%)): leader GPT-4o (31.72), 17 models - Wolfram LLM Benchmarking Project (Correct Functionality (%)): leader Claude Opus 4.7 thinking on (72.5), 443 models - MathArena - Project Euler 943-970 (Accuracy (%, direct Project Euler problems 943-970)): leader GPT-5.4 (xhigh) (87.5), 17 models - MathArena - Project Euler 971-984 (Accuracy (%, direct Project Euler problems 971-984)): leader Claude-Opus-4.6 (High) (92.86), 10 models - MathArena - Project Euler 985-988 (Accuracy (%, direct Project Euler problems 985-988)): leader Gemini 3.1 Pro Preview (100.0), 5 models - OpenVLM OCRBench (Score (normalized)): leader JT-VL-Chat-V3.0 (95.0), 285 models - Vals AI Vibe Code Bench (Accuracy (%)): leader claude-opus-4-7 (71.0), 41 models - EuroEval Albanian (Average Score (%)): leader gemini-3.1-pro-preview (65.43), 209 models - EuroEval Bosnian (Average Score (%)): leader gpt-4.1-mini-2025-04-14 (63.93), 218 models - EuroEval Bulgarian (Average Score (%)): leader gemini-3-pro-preview (74.47), 219 models - EuroEval Catalan (Average Score (%)): leader gemini-2.5-flash#thinking (68.12), 219 models - EuroEval Croatian (Average Score (%)): leader gemini-3-pro-preview (69.99), 218 models - EuroEval Czech (Average Score (%)): leader gemini-2.5-pro (70.02), 236 models - EuroEval Danish (Average Score (%)): leader gpt-5-2025-08-07#high (78.81), 454 models - EuroEval Dutch (Average Score (%)): leader Llama-3.1-405B (78.43), 350 models - EuroEval Estonian (Average Score (%)): leader gemini-2.5-pro (62.38), 258 models - EuroEval Faroese (Average Score (%)): leader gemini-3-pro-preview (70.72), 391 models - EuroEval Finnish (Average Score (%)): leader gpt-5-2025-08-07#high (72.92), 382 models - EuroEval French (Average Score (%)): leader gemini-3-pro-preview (74.38), 383 models - EuroEval German (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (68.41), 329 models - EuroEval Greek (Average Score (%)): leader gpt-5-2025-08-07 (72.28), 209 models - EuroEval Hungarian (Average Score (%)): leader gemini-2.5-pro (67.51), 208 models - EuroEval Icelandic (Average Score (%)): leader gpt-5-2025-08-07 (70.59), 399 models - EuroEval Italian (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (73.12), 435 models - EuroEval Latvian (Average Score (%)): leader gpt-5-2025-08-07 (70.85), 238 models - EuroEval Lithuanian (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (66.49), 235 models - EuroEval Norwegian (Average Score (%)): leader gpt-5-2025-08-07 (76.81), 466 models - EuroEval Polish (Average Score (%)): leader gpt-5-2025-08-07 (71.84), 241 models - EuroEval Portuguese (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (73.86), 445 models - EuroEval Romanian (Average Score (%)): leader gpt-5-2025-08-07 (72.03), 212 models - EuroEval Serbian (Average Score (%)): leader gpt-5-2025-08-07 (72.24), 209 models - EuroEval Slovak (Average Score (%)): leader gemini-3-pro-preview (68.36), 208 models - EuroEval Slovene (Average Score (%)): leader claude-sonnet-4-5-20250929#thinking (67.68), 208 models - EuroEval Spanish (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (68.78), 419 models - EuroEval Swedish (Average Score (%)): leader gpt-5-2025-08-07#high (78.64), 410 models - EuroEval Ukrainian (Average Score (%)): leader gpt-5-2025-08-07 (67.31), 205 models
NEW MODELS (3) - Grok 4.3 — ELO 1826, #104/1285 (above: MiMo-V2-Omni-0327, below: Kimi K2.6 (Non-reasoning)) AA IFBench: 81.3 (#2/409) - Mistral Medium 3.5 — ELO 1749, #189/1285 (above: DeepSeek V4 Flash (Non-reasoning), below: Claude Sonnet 4 (Thinking 16K)) - Hy3-preview (Non-reasoning) — ELO 1711, #233/1285 (above: Gemini 2.5 Pro, below: Qwen 3.6 27B)
NEW #1 LEADERS (5) - Design Arena (Data Viz) (Elo): mimo-v2.5-pro (1375.0) beat claude-sonnet-4-6 (1346.0) by 29.0 - Vals AI CaseLaw v2 (Accuracy (%)): grok-4.3 (79.31) beat gpt-5.1-2025-11-13 (73.42) by 5.89 - Vals AI Terminal-Bench 2.0 (Accuracy (%)): gpt-5.5 (73.2) beat claude-opus-4-7 (68.54) by 4.66 - OpenClawProBench (Overall Score (%)): qwen3.5-397b-a17b (70.4) beat qwen3.5-plus (70.1) by 0.3 - Vals AI CorpFin v2 (Accuracy (%)): grok-4.3 (68.53) beat gpt-5.5 (68.42) by 0.11