AI Benchmark Digest — 2026-04-29
AI Benchmark Digest — 2026-04-29
=== DAILY === NEW BENCHMARKS (29) - OpenVLM MME (Overall Score): leader InternVL3-78B (2538.6), 235 models - OpenVLM ScienceQA Test (Accuracy (%)): leader InternVL2.5-78B-MPO (99.5), 218 models - OpenVLM POPE (Overall (%)): leader InternVL2.5-26B-MPO (90.5), 216 models - OpenVLM SEED-Bench 2 Plus (Accuracy (%)): leader Qwen2.5-VL-72B (73.8), 211 models - OpenVLM COCO Captions (CIDEr): leader Emu2_chat (109.2), 211 models - OpenVLM MMT-Bench (Accuracy (%)): leader InternVL3-78B (72.6), 207 models - OpenVLM A-Bench (Accuracy (%)): leader Qwen2.5-VL-72B (81.0), 160 models - OpenVLM MTVQA (Accuracy (%)): leader GPT-4.1-mini-20250414 (36.8), 157 models - OpenVLM OCR-VQA (Accuracy (%)): leader Kimi-VL-A3B-Instruct (82.0), 118 models - OpenVLM SEED-Bench 2 (Accuracy (%)): leader GPT-4.1-20250414 (76.0), 59 models - OpenVLM VCR (Overall Jaccard (%)): leader Qwen2-VL-7B (75.6), 48 models - CLEM Clemscore (Clemscore (%)): leader claude-sonnet-4-5-azure-high (90.1), 31 models - CLEM AdventureGame (Game Clemscore (%)): leader gpt-5.2-azure-high (99.17), 31 models - CLEM Clean Up (Game Clemscore (%)): leader gpt-5.2-azure-high (100.0), 31 models - CLEM Codenames (Game Clemscore (%)): leader gpt-5.2-azure-high (87.69), 31 models - CLEM Deal or No Deal (Game Clemscore (%)): leader gpt-5.2-azure-high (99.12), 31 models - CLEM GuessWhat (Game Clemscore (%)): leader gpt-5.2-azure-high (93.33), 31 models - CLEM Hot Air Balloon (Game Clemscore (%)): leader claude-sonnet-4-5-azure-high (95.53), 31 models - CLEM ImageGame (Game Clemscore (%)): leader gpt-5.2-2025-12-11 (99.92), 31 models - CLEM MatchIt ASCII (Game Clemscore (%)): leader claude-sonnet-4-5-20250929 (100.0), 31 models - CLEM PrivateShared (Game Clemscore (%)): leader claude-sonnet-4-5-20250929 (98.7), 31 models - CLEM ReferenceGame (Game Clemscore (%)): leader claude-sonnet-4-5-20250929 (100.0), 31 models - CLEM Taboo (Game Clemscore (%)): leader claude-sonnet-4-5-azure-low (98.33), 31 models - CLEM TextMapWorld (Game Clemscore (%)): leader gemini-3-flash (91.35), 31 models - CLEM TextMapWorld GraphReasoning (Game Clemscore (%)): leader claude-sonnet-4-5-azure-low (86.34), 31 models - CLEM TextMapWorld SpecificRoom (Game Clemscore (%)): leader Llama-3.1-70B-Instruct (100.0), 31 models - CLEM Wordle (Game Clemscore (%)): leader kimi-k2-thinking (73.0), 31 models - CLEM Wordle with Clue (Game Clemscore (%)): leader claude-sonnet-4-5-azure-high (82.5), 31 models - CLEM Wordle with Critic (Game Clemscore (%)): leader claude-sonnet-4-5-azure-high (86.11), 31 models
NEW MODELS (1) - JSL-MedMNX-7B-SFT — ELO 1309, #841/1055 (above: Llama-3-Orca-1.0-8B, below: Collaiborator-MEDLLM-Llama-3-8B-v2-4)
NEW #1 LEADERS (1) - Epoch AI - ECI (ECI Score): GPT-5.5 Pro (xhigh) (158.67) beat GPT-5.4 Pro (xhigh) (158.29) by 0.38