Mikhail Doroshenko

Archives
May 10, 2026

AI Benchmark Digest — 2026-05-10

AI Benchmark Digest — 2026-05-10

=== DAILY === NEW BENCHMARKS (43) - AA Global-MMLU-Lite - Arabic (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.0), 119 models - AA Global-MMLU-Lite - Bengali (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.17), 119 models - AA Global-MMLU-Lite - German (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.75), 119 models - AA Global-MMLU-Lite - English (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (95.17), 120 models - AA Global-MMLU-Lite - Spanish (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.42), 118 models - AA Global-MMLU-Lite - French (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models - AA Global-MMLU-Lite - Hindi (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 117 models - AA Global-MMLU-Lite - Indonesian (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models - AA Global-MMLU-Lite - Italian (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.58), 117 models - AA Global-MMLU-Lite - Japanese (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 116 models - AA Global-MMLU-Lite - Korean (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.0), 116 models - AA Global-MMLU-Lite - Burmese (Accuracy (%)): leader Gemini 3.1 Pro Preview (91.17), 111 models - AA Global-MMLU-Lite - Portuguese (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.25), 113 models - AA Global-MMLU-Lite - Swahili (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.33), 112 models - AA Global-MMLU-Lite - Yoruba (Accuracy (%)): leader Gemini 3.1 Pro Preview (88.75), 112 models - AA Global-MMLU-Lite - Chinese (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.58), 113 models - AA Omniscience - Business (Accuracy (%)): leader GPT-5.5 (xhigh) (49.1), 388 models - AA Omniscience - Health (Accuracy (%)): leader GPT-5.5 (medium) (48.8), 388 models - AA Omniscience - Humanities & Social Sciences (Accuracy (%)): leader Gemini 3 Pro Preview (high) (56.6), 388 models - AA Omniscience - Law (Accuracy (%)): leader Gemini 3 Pro Preview (high) (64.3), 388 models - AA Omniscience - Science, Engineering & Mathematics (Accuracy (%)): leader GPT-5.5 (high) (52.3), 388 models - AA Omniscience - Software Engineering (SWE) (Accuracy (%)): leader GPT-5.5 (xhigh) (84.4), 388 models - AA Omniscience - Software Engineering (SWE) - C (Accuracy (%)): leader GPT-5.5 (high) (92.0), 388 models - AA Omniscience - Software Engineering (SWE) - Dart (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (80.0), 388 models - AA Omniscience - Software Engineering (SWE) - Go (Accuracy (%)): leader GPT-5.5 (high) (84.0), 388 models - AA Omniscience - Software Engineering (SWE) - HTML (Accuracy (%)): leader GPT-5.5 (medium) (90.0), 388 models - AA Omniscience - Software Engineering (SWE) - Java (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (73.0), 388 models - AA Omniscience - Software Engineering (SWE) - JavaScript (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.91), 388 models - AA Omniscience - Software Engineering (SWE) - Julia (Accuracy (%)): leader GPT-5.4 (low) (88.0), 388 models - AA Omniscience - Software Engineering (SWE) - Kotlin (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.0), 388 models - AA Omniscience - Software Engineering (SWE) - PHP (Accuracy (%)): leader GPT-5.5 (medium) (92.0), 388 models - AA Omniscience - Software Engineering (SWE) - Python (Accuracy (%)): leader GPT-5.5 (xhigh) (90.5), 388 models - AA Omniscience - Software Engineering (SWE) - R (Accuracy (%)): leader GPT-5.5 (medium) (74.0), 388 models - AA Omniscience - Software Engineering (SWE) - Rust (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models - AA Omniscience - Software Engineering (SWE) - Swift (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models - AA Omniscience - Software Engineering (SWE) - TypeScript (Accuracy (%)): leader GPT-5.5 (xhigh) (91.11), 388 models - EuroEval Albanian NLU - MMS SQ (Sentiment classification Score (%)): leader gemini-3-flash-preview#no-thinking (32.13), 196 models EuroEval Albanian NLU task column for the MMS SQ dataset, measuring sentiment classification from the public albanian_nlu.csv leaderboard. - EuroEval Albanian NLU - WikiANN SQ (Named entity recognition Score (%)): leader multilingual-e5-large (86.6), 200 models EuroEval Albanian NLU task column for the WikiANN SQ dataset, measuring named entity recognition from the public albanian_nlu.csv leaderboard. - EuroEval Albanian NLU - ScaLA SQ (Linguistic acceptability Score (%)): leader gemini-3.1-pro-preview (78.55), 166 models EuroEval Albanian NLU task column for the ScaLA SQ dataset, measuring linguistic acceptability from the public albanian_nlu.csv leaderboard. - EuroEval Albanian NLU - MultiWikiQA SQ (Reading comprehension Score (%)): leader Qwen3.5-9B-Base (70.8), 200 models EuroEval Albanian NLU task column for the MultiWikiQA SQ dataset, measuring reading comprehension from the public albanian_nlu.csv leaderboard. - EuroEval Bosnian NLU - MMS BS (Sentiment classification Score (%)): leader gpt-4.1-mini-2025-04-14 (56.43), 208 models EuroEval Bosnian NLU task column for the MMS BS dataset, measuring sentiment classification from the public bosnian_nlu.csv leaderboard. - EuroEval Bosnian NLU - WikiANN BS (Named entity recognition Score (%)): leader multilingual-e5-large (84.87), 212 models EuroEval Bosnian NLU task column for the WikiANN BS dataset, measuring named entity recognition from the public bosnian_nlu.csv leaderboard. - EuroEval Bosnian NLU - Multi Wiki QA BS (Reading comprehension Score (%)): leader Olmo-3-1125-32B (78.64), 211 models EuroEval Bosnian NLU task column for the Multi Wiki QA BS dataset, measuring reading comprehension from the public bosnian_nlu.csv leaderboard.

=== WEEKLY === NEW BENCHMARKS (43) - AA Global-MMLU-Lite - Arabic (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.0), 119 models - AA Global-MMLU-Lite - Bengali (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.17), 119 models - AA Global-MMLU-Lite - German (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.75), 119 models - AA Global-MMLU-Lite - English (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (95.17), 120 models - AA Global-MMLU-Lite - Spanish (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.42), 118 models - AA Global-MMLU-Lite - French (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models - AA Global-MMLU-Lite - Hindi (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 117 models - AA Global-MMLU-Lite - Indonesian (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models - AA Global-MMLU-Lite - Italian (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.58), 117 models - AA Global-MMLU-Lite - Japanese (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 116 models - AA Global-MMLU-Lite - Korean (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.0), 116 models - AA Global-MMLU-Lite - Burmese (Accuracy (%)): leader Gemini 3.1 Pro Preview (91.17), 111 models - AA Global-MMLU-Lite - Portuguese (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.25), 113 models - AA Global-MMLU-Lite - Swahili (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.33), 112 models - AA Global-MMLU-Lite - Yoruba (Accuracy (%)): leader Gemini 3.1 Pro Preview (88.75), 112 models - AA Global-MMLU-Lite - Chinese (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.58), 113 models - AA Omniscience - Business (Accuracy (%)): leader GPT-5.5 (xhigh) (49.1), 388 models - AA Omniscience - Health (Accuracy (%)): leader GPT-5.5 (medium) (48.8), 388 models - AA Omniscience - Humanities & Social Sciences (Accuracy (%)): leader Gemini 3 Pro Preview (high) (56.6), 388 models - AA Omniscience - Law (Accuracy (%)): leader Gemini 3 Pro Preview (high) (64.3), 388 models - AA Omniscience - Science, Engineering & Mathematics (Accuracy (%)): leader GPT-5.5 (high) (52.3), 388 models - AA Omniscience - Software Engineering (SWE) (Accuracy (%)): leader GPT-5.5 (xhigh) (84.4), 388 models - AA Omniscience - Software Engineering (SWE) - C (Accuracy (%)): leader GPT-5.5 (high) (92.0), 388 models - AA Omniscience - Software Engineering (SWE) - Dart (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (80.0), 388 models - AA Omniscience - Software Engineering (SWE) - Go (Accuracy (%)): leader GPT-5.5 (high) (84.0), 388 models - AA Omniscience - Software Engineering (SWE) - HTML (Accuracy (%)): leader GPT-5.5 (medium) (90.0), 388 models - AA Omniscience - Software Engineering (SWE) - Java (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (73.0), 388 models - AA Omniscience - Software Engineering (SWE) - JavaScript (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.91), 388 models - AA Omniscience - Software Engineering (SWE) - Julia (Accuracy (%)): leader GPT-5.4 (low) (88.0), 388 models - AA Omniscience - Software Engineering (SWE) - Kotlin (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.0), 388 models - AA Omniscience - Software Engineering (SWE) - PHP (Accuracy (%)): leader GPT-5.5 (medium) (92.0), 388 models - AA Omniscience - Software Engineering (SWE) - Python (Accuracy (%)): leader GPT-5.5 (xhigh) (90.5), 388 models - AA Omniscience - Software Engineering (SWE) - R (Accuracy (%)): leader GPT-5.5 (medium) (74.0), 388 models - AA Omniscience - Software Engineering (SWE) - Rust (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models - AA Omniscience - Software Engineering (SWE) - Swift (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models - AA Omniscience - Software Engineering (SWE) - TypeScript (Accuracy (%)): leader GPT-5.5 (xhigh) (91.11), 388 models - EuroEval Albanian NLU - MMS SQ (Sentiment classification Score (%)): leader gemini-3-flash-preview#no-thinking (32.13), 196 models EuroEval Albanian NLU task column for the MMS SQ dataset, measuring sentiment classification from the public albanian_nlu.csv leaderboard. - EuroEval Albanian NLU - WikiANN SQ (Named entity recognition Score (%)): leader multilingual-e5-large (86.6), 200 models EuroEval Albanian NLU task column for the WikiANN SQ dataset, measuring named entity recognition from the public albanian_nlu.csv leaderboard. - EuroEval Albanian NLU - ScaLA SQ (Linguistic acceptability Score (%)): leader gemini-3.1-pro-preview (78.55), 166 models EuroEval Albanian NLU task column for the ScaLA SQ dataset, measuring linguistic acceptability from the public albanian_nlu.csv leaderboard. - EuroEval Albanian NLU - MultiWikiQA SQ (Reading comprehension Score (%)): leader Qwen3.5-9B-Base (70.8), 200 models EuroEval Albanian NLU task column for the MultiWikiQA SQ dataset, measuring reading comprehension from the public albanian_nlu.csv leaderboard. - EuroEval Bosnian NLU - MMS BS (Sentiment classification Score (%)): leader gpt-4.1-mini-2025-04-14 (56.43), 208 models EuroEval Bosnian NLU task column for the MMS BS dataset, measuring sentiment classification from the public bosnian_nlu.csv leaderboard. - EuroEval Bosnian NLU - WikiANN BS (Named entity recognition Score (%)): leader multilingual-e5-large (84.87), 212 models EuroEval Bosnian NLU task column for the WikiANN BS dataset, measuring named entity recognition from the public bosnian_nlu.csv leaderboard. - EuroEval Bosnian NLU - Multi Wiki QA BS (Reading comprehension Score (%)): leader Olmo-3-1125-32B (78.64), 211 models EuroEval Bosnian NLU task column for the Multi Wiki QA BS dataset, measuring reading comprehension from the public bosnian_nlu.csv leaderboard.

NEW MODELS (38) - GLM-5V Turbo (Reasoning) — ELO 1738, #102/798 (above: MiMo-V2-Flash (Reasoning), below: DeepSeek V3.2 Speciale) AA TAU-2 Bench: 98.5 (#3/402) AA GDPval: 1330.87 (#43/360) AA MMMU-Pro: 72.8 (#44/188) AA SciCode: 43.5 (#52/477) Artificial Analysis Intelligence Index: 42.85 (#56/482) AA Terminal-Bench Hard: 32.6 (#79/397) AA Omniscience: -18.98 (#80/388) AA Long Context Reasoning: 61.0 (#84/411) AA Humanity's Last Exam: 15.8 (#91/479) AA GPQA Diamond: 80.9 (#96/483) - ERNIE 5.0 Thinking Preview — ELO 1631, #214/798 (above: iFlytek-Spark-X1.5-0106-NoThink, below: Qwen 3.6 27B) AA LiveCodeBench: 81.2 (#24/343) AA Global-MMLU-Lite: 86.5 (#33/120) AA AIME 2025: 85.0 (#46/269) AA MMLU-Pro: 83.0 (#60/345) AA CritPt: 1.4 (#68/388) AA MMMU-Pro: 64.6 (#90/188) AA TAU-2 Bench: 83.9 (#94/402) AA Humanity's Last Exam: 12.7 (#116/479) AA Terminal-Bench Hard: 25.0 (#119/397) AA GPQA Diamond: 77.7 (#124/483) - K-EXAONE (Reasoning) — ELO 1603, #245/798 (above: GPT-5 (Minimal), below: Gemini 2.5 Flash (Thinking)) AA AIME 2025: 90.3 (#25/269) AA LiveCodeBench: 76.8 (#41/343) AA MMLU-Pro: 83.8 (#44/345) AA CritPt: 1.1 (#76/388) AA Global-MMLU-Lite: 78.86 (#80/120) AA IFBench: 64.7 (#85/411) AA Humanity's Last Exam: 13.1 (#111/479) AA Long Context Reasoning: 55.7 (#117/411) AA GPQA Diamond: 78.3 (#119/483) AA TAU-2 Bench: 74.3 (#121/402) - EXAONE 4.5 33B — ELO 1578, #277/798 (above: GPT-4.5, below: Gemini 2.5 Flash) AA MMMU-Pro: 67.3 (#77/188) AA GPQA Diamond: 79.4 (#106/483) AA IFBench: 58.0 (#107/411) AA TAU-2 Bench: 78.1 (#112/402) AA CritPt: 0.3 (#128/388) AA Humanity's Last Exam: 11.6 (#131/479) AA Terminal-Bench Hard: 20.5 (#144/397) Artificial Analysis Intelligence Index: 30.23 (#147/482) AA Long Context Reasoning: 49.3 (#150/411) AA GDPval: 812.72 (#163/360) - K2-V2 (High) — ELO 1562, #294/798 (above: GLM-4.6V (Reasoning), below: Claude 3.7 Sonnet) AA AIME 2025: 78.3 (#71/269) AA LiveCodeBench: 69.4 (#76/343) AA Global-MMLU-Lite: 78.6 (#82/120) AA IFBench: 60.1 (#102/411) AA MMLU-Pro: 78.6 (#135/345) AA Humanity's Last Exam: 9.8 (#157/479) AA Long Context Reasoning: 33.3 (#211/411) AA Terminal-Bench Hard: 9.8 (#212/397) AA GPQA Diamond: 68.1 (#222/483) Artificial Analysis Intelligence Index: 20.61 (#232/482) - Solar Open 100B (Reasoning) — ELO 1555, #307/798 (above: Qwen 3 VL 32B (Thinking), below: Gemma 4 26B A4B (Reasoning)) AA Global-MMLU-Lite: 81.58 (#61/120) AA IFBench: 57.7 (#110/411) AA Humanity's Last Exam: 9.2 (#170/479) AA TAU-2 Bench: 48.2 (#180/402) AA Long Context Reasoning: 36.0 (#195/411) AA CritPt: 0.0 (#204/388) AA GDPval: 666.33 (#207/360) Artificial Analysis Intelligence Index: 21.67 (#224/482) AA GPQA Diamond: 65.7 (#243/483) AA Omniscience: -54.1 (#262/388) - JT-MINI — ELO 1546, #324/798 (above: Llama-Poro-2-70B-SFT, below: O3 Mini (Low)) AA TAU-2 Bench: 93.0 (#40/402) AA Terminal-Bench Hard: 18.2 (#154/397) AA GDPval: 831.97 (#157/360) Artificial Analysis Intelligence Index: 25.37 (#187/482) AA Humanity's Last Exam: 6.6 (#223/479) AA GPQA Diamond: 67.6 (#225/483) AA CritPt: 0.0 (#263/388) AA IFBench: 36.7 (#277/411) AA SciCode: 27.2 (#292/477) AA Long Context Reasoning: 11.7 (#308/411) - K2 Think V2 — ELO 1545, #328/798 (above: Hermes 4 - Llama-3.1 405B (Non-reasoning), below: DeepSeek V3) AA IFBench: 62.8 (#94/411) AA Omniscience: -33.92 (#125/388) AA Long Context Reasoning: 52.7 (#135/411) AA Humanity's Last Exam: 9.5 (#165/479) AA GPQA Diamond: 71.3 (#192/483) Artificial Analysis Intelligence Index: 24.12 (#201/482) AA GDPval: 607.98 (#222/360) AA SciCode: 33.0 (#223/477) AA Terminal-Bench Hard: 6.8 (#240/397) AA CritPt: 0.0 (#252/388) - HyperCLOVA X SEED Think (32B) — ELO 1537, #342/798 (above: Nova Premier, below: Claude 3.5 Sonnet) AA TAU-2 Bench: 87.4 (#68/402) AA Global-MMLU-Lite: 78.6 (#83/120) AA LiveCodeBench: 62.9 (#107/343) AA AIME 2025: 59.0 (#118/269) AA MMLU-Pro: 78.5 (#137/345) AA Terminal-Bench Hard: 12.1 (#194/397) AA GDPval: 678.83 (#199/360) Artificial Analysis Intelligence Index: 23.72 (#204/482) AA Omniscience: -52.87 (#255/388) AA CritPt: 0.0 (#257/388) - Mi:dm K 2.5 Pro — ELO 1527, #352/798 (above: phi4, below: Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)) AA TAU-2 Bench: 86.5 (#75/402) AA AIME 2025: 76.7 (#77/269) AA LiveCodeBench: 65.6 (#92/343) AA Global-MMLU-Lite: 74.23 (#94/120) AA MMLU-Pro: 80.9 (#97/345) AA IFBench: 49.3 (#155/411) AA Humanity's Last Exam: 7.7 (#195/479) AA GPQA Diamond: 70.1 (#200/483) Artificial Analysis Intelligence Index: 23.06 (#213/482) AA GDPval: 643.11 (#213/360) - Motif-2-12.7B (Reasoning) — ELO 1520, #366/798 (above: Grok 4.1 Fast, below: Llama 3.1 405B) AA AIME 2025: 80.3 (#65/269) AA LiveCodeBench: 65.1 (#97/343) AA IFBench: 57.0 (#113/411) AA MMLU-Pro: 79.6 (#122/345) AA Humanity's Last Exam: 8.2 (#183/479) AA TAU-2 Bench: 46.5 (#185/402) AA GPQA Diamond: 69.5 (#210/483) Artificial Analysis Intelligence Index: 19.08 (#244/482) AA CritPt: 0.0 (#250/388) AA GDPval: 485.33 (#255/360) - Mi:dm K 2.5 Pro Preview — ELO 1517, #371/798 (above: Qwen 3 4B 2507 (Thinking), below: Llama 3.1 70B) AA Global-MMLU-Lite: 81.43 (#63/120) AA AIME 2025: 78.7 (#70/269) AA MMLU-Pro: 81.3 (#92/345) AA LiveCodeBench: 57.6 (#125/343) AA Humanity's Last Exam: 8.8 (#175/479) AA TAU-2 Bench: 49.4 (#177/402) AA IFBench: 45.6 (#180/411) AA GPQA Diamond: 72.2 (#185/483) AA SciCode: 29.7 (#251/477) AA CritPt: 0.0 (#255/388) - K2-V2 (Medium) — ELO 1512, #382/798 (above: Qwen 3 VL 32B, below: Nemotron 3 Nano Omni 30B-A3B (Reasoning)) AA Global-MMLU-Lite: 76.7 (#87/120) AA AIME 2025: 64.7 (#107/269) AA IFBench: 55.1 (#122/411) AA LiveCodeBench: 54.1 (#137/343) AA MMLU-Pro: 76.1 (#165/345) AA Terminal-Bench Hard: 8.3 (#220/397) AA Omniscience: -49.97 (#222/388) AA GDPval: 578.73 (#227/360) AA Long Context Reasoning: 28.0 (#232/411) AA CritPt: 0.0 (#251/388) - Granite 4.1 30B — ELO 1491, #425/798 (above: Gemini 2.0 Flash Lite, below: Qwen 2.5 14B) AA IFBench: 44.4 (#191/411) AA TAU-2 Bench: 42.1 (#198/402) AA CritPt: 0.0 (#228/388) AA GDPval: 495.5 (#253/360) AA Long Context Reasoning: 18.7 (#273/411) AA Terminal-Bench Hard: 2.3 (#310/397) AA SciCode: 25.8 (#315/477) Artificial Analysis Intelligence Index: 14.69 (#324/482) AA Omniscience: -67.78 (#342/388) AA GPQA Diamond: 48.1 (#354/483) - K-EXAONE (Non-reasoning) — ELO 1487, #432/798 (above: Mistral Small 3.2, below: GPT-5 Mini (Minimal)) AA MMLU-Pro: 81.0 (#94/345) AA Global-MMLU-Lite: 71.03 (#104/120) AA AIME 2025: 44.0 (#150/269) AA Long Context Reasoning: 47.0 (#157/411) AA TAU-2 Bench: 59.1 (#162/402) AA GDPval: 767.0 (#174/360) Artificial Analysis Intelligence Index: 23.41 (#207/482) AA GPQA Diamond: 69.5 (#209/483) AA Terminal-Bench Hard: 6.8 (#239/397) AA CritPt: 0.0 (#242/388) - K2-V2 (Low) — ELO 1483, #444/798 (above: Devstral Small 2, below: Ministral 3 8B) AA Global-MMLU-Lite: 71.44 (#103/120) AA AIME 2025: 35.3 (#173/269) AA LiveCodeBench: 39.3 (#187/343) AA MMLU-Pro: 71.3 (#212/345) AA Omniscience: -48.07 (#212/388) AA IFBench: 41.0 (#233/411) AA CritPt: 0.0 (#254/388) AA Long Context Reasoning: 19.0 (#271/411) AA Terminal-Bench Hard: 4.5 (#277/397) AA GDPval: 367.48 (#285/360) - Solar Pro 2 (Reasoning) — ELO 1479, #450/798 (above: GLM-4.5V, below: ERNIE 4.5 300B A47B) AA MATH-500: 96.7 (#30/193) AA Global-MMLU-Lite: 79.61 (#78/120) AA MMLU-Pro: 80.5 (#107/345) AA LiveCodeBench: 61.6 (#113/343) AA AIME 2025: 61.3 (#115/269) AA CritPt: 0.0 (#206/388) AA Humanity's Last Exam: 7.0 (#213/479) AA GPQA Diamond: 68.7 (#215/483) AA SciCode: 30.2 (#246/477) AA TAU-2 Bench: 28.1 (#251/402) - Gemma 4 E4B (Reasoning) — ELO 1474, #458/798 (above: Yi 1.5 34B, below: Hunyuan A13B-Instruct) AA Omniscience: -20.05 (#82/388) AA CritPt: 0.6 (#104/388) AA MMMU-Pro: 51.4 (#143/188) AA IFBench: 44.2 (#193/411) AA Terminal-Bench Hard: 8.3 (#218/397) AA Long Context Reasoning: 30.7 (#222/411) Artificial Analysis Intelligence Index: 18.76 (#250/482) AA GPQA Diamond: 57.6 (#297/483) AA GDPval: 304.3 (#312/360) AA TAU-2 Bench: 20.8 (#314/402) - EXAONE 4.0 32B (Reasoning) — ELO 1473, #461/798 (above: Mistral Small 3, below: Tri-21B-Think Preview) AA MATH-500: 97.7 (#21/193) AA LiveCodeBench: 74.7 (#48/343) AA AIME 2025: 80.0 (#68/269) AA MMLU-Pro: 81.8 (#82/345) AA Global-MMLU-Lite: 73.46 (#97/120) AA Humanity's Last Exam: 10.5 (#145/479) AA GPQA Diamond: 73.9 (#167/483) AA SciCode: 34.4 (#203/477) AA CritPt: 0.0 (#240/388) AA GDPval: 499.86 (#249/360) - Tri-21B-Think Preview — ELO 1473, #462/798 (above: EXAONE 4.0 32B (Reasoning), below: GPT-4o Mini) AA TAU-2 Bench: 93.3 (#38/402) AA IFBench: 47.1 (#169/411) Artificial Analysis Intelligence Index: 19.99 (#236/482) AA Humanity's Last Exam: 5.7 (#257/479) AA CritPt: 0.0 (#259/388) AA Omniscience: -55.28 (#267/388) AA Long Context Reasoning: 14.7 (#294/411) AA GDPval: 337.02 (#299/360) AA Terminal-Bench Hard: 2.3 (#315/397) AA GPQA Diamond: 53.8 (#320/483) - Tri-21B-Think — ELO 1468, #468/798 (above: Ling-flash-2.0, below: Hermes 3 - Llama-3.1 70B) AA TAU-2 Bench: 81.0 (#103/402) AA IFBench: 54.6 (#124/411) AA CritPt: 0.3 (#132/388) AA Humanity's Last Exam: 6.1 (#241/479) Artificial Analysis Intelligence Index: 18.62 (#258/482) AA GPQA Diamond: 60.1 (#279/483) AA GDPval: 374.11 (#282/360) AA Long Context Reasoning: 11.0 (#312/411) AA Omniscience: -63.3 (#321/388) AA Terminal-Bench Hard: 0.8 (#342/397) - GPT-4o (March 2025, chatgpt-4o-latest) — ELO 1449, #500/798 (above: Phi-3-small-8k-instruct, below: DeepSeek R1 0528 Qwen3 8B) AA MATH-500: 89.3 (#73/193) AA MMLU-Pro: 80.3 (#110/345) AA SciCode: 36.6 (#165/477) AA LiveCodeBench: 42.5 (#170/343) AA AIME 2025: 25.7 (#196/269) AA GPQA Diamond: 65.5 (#247/483) Artificial Analysis Intelligence Index: 18.56 (#260/482) AA Humanity's Last Exam: 5.0 (#305/479) - Llama 3.3 Nemotron Super 49B v1 (Reasoning) — ELO 1448, #502/798 (above: DeepSeek R1 0528 Qwen3 8B, below: Qwen 2.5 Coder 14B) AA MATH-500: 95.9 (#36/193) AA AIME 2025: 54.7 (#132/269) AA MMLU-Pro: 78.5 (#136/345) AA CritPt: 0.0 (#215/388) AA Humanity's Last Exam: 6.5 (#227/479) AA LiveCodeBench: 27.7 (#238/343) AA GPQA Diamond: 64.3 (#251/483) Artificial Analysis Intelligence Index: 18.49 (#262/482) AA TAU-2 Bench: 26.9 (#262/402) AA IFBench: 38.1 (#262/411) - Solar Pro 2 (Non-reasoning) — ELO 1435, #524/798 (above: Mistral Small, below: NVIDIA Nemotron Nano 9B V2) AA MATH-500: 88.9 (#76/193) AA Global-MMLU-Lite: 75.34 (#91/120) AA LiveCodeBench: 42.4 (#172/343) AA MMLU-Pro: 75.0 (#178/345) AA AIME 2025: 30.0 (#186/269) AA CritPt: 0.0 (#203/388) AA TAU-2 Bench: 31.9 (#230/402) AA GDPval: 447.04 (#265/360) AA Terminal-Bench Hard: 4.5 (#273/397) AA IFBench: 33.7 (#306/411) - Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) — ELO 1408, #560/798 (above: mistral-nemo-minitron-8B-instruct, below: OLMo 3 7B (Thinking)) AA MATH-500: 77.5 (#113/193) AA CritPt: 0.0 (#216/388) AA Omniscience: -49.68 (#219/388) AA MMLU-Pro: 69.8 (#221/345) AA LiveCodeBench: 28.0 (#235/343) AA AIME 2025: 7.7 (#237/269) AA IFBench: 39.5 (#247/411) AA Long Context Reasoning: 11.3 (#309/411) AA GPQA Diamond: 51.7 (#330/483) Artificial Analysis Intelligence Index: 14.35 (#336/482) - NVIDIA Nemotron 3 Nano 4B — ELO 1388, #586/798 (above: DeepSeek R1 Distill Qwen 14B, below: Yi 34B (Chat)) AA IFBench: 58.2 (#106/411) AA CritPt: 0.0 (#211/388) AA Terminal-Bench Hard: 6.8 (#238/397) AA TAU-2 Bench: 28.1 (#252/402) AA GDPval: 476.83 (#258/360) AA Long Context Reasoning: 16.7 (#286/411) AA Humanity's Last Exam: 4.8 (#323/479) Artificial Analysis Intelligence Index: 14.68 (#325/482) AA GPQA Diamond: 51.3 (#338/483) AA Omniscience: -71.53 (#351/388) - Granite 4.1 3B — ELO 1380, #595/798 (above: granite-20B-code-base, below: DeepSeek V2.5) AA CritPt: 0.0 (#232/388) AA GDPval: 366.32 (#286/360) AA IFBench: 33.7 (#307/411) AA Terminal-Bench Hard: 2.3 (#312/397) AA TAU-2 Bench: 19.6 (#323/402) AA Long Context Reasoning: 3.0 (#341/411) AA Omniscience: -77.38 (#370/388) AA SciCode: 11.9 (#412/477) Artificial Analysis Intelligence Index: 8.54 (#435/482) AA GPQA Diamond: 31.4 (#441/483) - Gemma 4 E2B (Reasoning) — ELO 1376, #604/798 (above: normistral-11B-warm, below: Nova Micro) AA Omniscience: -23.98 (#94/388) AA MMMU-Pro: 44.6 (#160/188) AA CritPt: 0.0 (#170/388) AA IFBench: 38.0 (#265/411) AA Long Context Reasoning: 15.0 (#292/411) AA Terminal-Bench Hard: 3.0 (#299/397) Artificial Analysis Intelligence Index: 15.21 (#309/482) AA TAU-2 Bench: 20.8 (#315/402) AA Humanity's Last Exam: 4.8 (#322/479) AA GDPval: 272.59 (#338/360) - Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) — ELO 1351, #630/798 (above: falcon-180B, below: SOLAR-10.7B-Instruct-v1.0) AA MATH-500: 94.7 (#41/193) AA AIME 2025: 50.0 (#140/269) AA LiveCodeBench: 49.3 (#153/343) AA MMLU-Pro: 55.6 (#283/345) AA Humanity's Last Exam: 5.1 (#289/479) Artificial Analysis Intelligence Index: 14.43 (#334/482) AA Long Context Reasoning: 0.0 (#358/411) AA TAU-2 Bench: 11.7 (#362/402) AA IFBench: 25.5 (#375/411) AA GPQA Diamond: 40.8 (#393/483) - Ling-mini-2.0 — ELO 1346, #635/798 (above: Tiny Aya Global, below: Qwen 3.5 2B (Reasoning)) AA AIME 2025: 49.3 (#142/269) AA LiveCodeBench: 42.9 (#169/343) AA MMLU-Pro: 67.1 (#243/345) AA CritPt: 0.0 (#284/388) AA Humanity's Last Exam: 5.0 (#304/479) AA GPQA Diamond: 56.2 (#306/483) AA Long Context Reasoning: 6.7 (#329/411) AA GDPval: 264.15 (#341/360) AA Terminal-Bench Hard: 0.8 (#345/397) AA TAU-2 Bench: 13.2 (#356/402) - Jamba Reasoning 3B — ELO 1320, #657/798 (above: granite-3.0-3B-a800m-instruct, below: LLaMA-33B) AA IFBench: 52.4 (#137/411) AA AIME 2025: 10.7 (#231/269) AA LiveCodeBench: 21.0 (#267/343) AA CritPt: 0.0 (#268/388) AA MMLU-Pro: 57.7 (#274/345) AA Long Context Reasoning: 7.0 (#323/411) AA TAU-2 Bench: 15.8 (#342/402) AA Terminal-Bench Hard: 0.8 (#344/397) AA GDPval: 257.67 (#345/360) AA Humanity's Last Exam: 4.6 (#347/479) - Exaone 4.0 1.2B (Reasoning) — ELO 1266, #696/798 (above: Gemma 4 E4B, below: Exaone 4.0 1.2B (Non-reasoning)) AA AIME 2025: 50.3 (#139/269) AA LiveCodeBench: 51.6 (#143/343) AA CritPt: 0.0 (#241/388) AA Humanity's Last Exam: 5.8 (#251/479) AA MMLU-Pro: 58.8 (#268/345) AA GDPval: 296.88 (#317/360) AA GPQA Diamond: 51.5 (#336/483) AA TAU-2 Bench: 16.4 (#338/402) AA Long Context Reasoning: 0.0 (#370/411) AA Terminal-Bench Hard: 0.0 (#377/397) - Exaone 4.0 1.2B (Non-reasoning) — ELO 1262, #697/798 (above: Exaone 4.0 1.2B (Reasoning), below: comma-v0.1-1t) AA AIME 2025: 24.0 (#200/269) AA LiveCodeBench: 29.3 (#226/343) AA CritPt: 0.0 (#239/388) AA Humanity's Last Exam: 5.8 (#250/479) AA MMLU-Pro: 50.0 (#294/345) AA GDPval: 298.76 (#316/360) AA TAU-2 Bench: 20.5 (#318/402) AA Long Context Reasoning: 0.0 (#369/411) AA Terminal-Bench Hard: 0.0 (#376/397) AA IFBench: 25.3 (#376/411) - Granite 4.0 1B — ELO 1258, #701/798 (above: PaLM 62B, below: Gemma 1.1 7B) AA CritPt: 0.0 (#234/388) AA AIME 2025: 6.3 (#244/269) AA Humanity's Last Exam: 5.1 (#292/479) AA TAU-2 Bench: 22.8 (#294/402) AA MMLU-Pro: 32.5 (#331/345) AA LiveCodeBench: 4.7 (#333/343) AA Long Context Reasoning: 4.0 (#340/411) AA GDPval: 259.61 (#342/360) AA Terminal-Bench Hard: 0.0 (#373/397) AA Omniscience: -81.82 (#377/388) - Granite 4.0 H 350M — ELO 1137, #759/798 (above: starcoder2-3B, below: Phi-1.5) AA CritPt: 0.0 (#227/388) AA Humanity's Last Exam: 6.4 (#228/479) AA AIME 2025: 1.3 (#262/269) AA GDPval: 294.09 (#319/360) AA LiveCodeBench: 1.9 (#339/343) AA MMLU-Pro: 12.7 (#343/345) AA TAU-2 Bench: 14.6 (#349/402) AA Long Context Reasoning: 0.0 (#366/411) AA Terminal-Bench Hard: 0.0 (#369/397) AA Omniscience: -87.25 (#387/388) - OLMo 2 32B — ELO 1037, #780/798 (above: Dolly V2 12B, below: Phi-3 Mini Instruct 3.8B) AA AIME 2025: 3.3 (#256/269) AA IFBench: 38.1 (#264/411) AA MMLU-Pro: 51.1 (#292/345) AA LiveCodeBench: 6.8 (#328/343) AA Terminal-Bench Hard: 0.0 (#391/397) AA Long Context Reasoning: 0.0 (#393/411) Artificial Analysis Intelligence Index: 10.57 (#397/482) AA TAU-2 Bench: 0.0 (#401/402) AA GPQA Diamond: 32.8 (#429/483) AA SciCode: 8.0 (#437/477) - Phi-3 Mini Instruct 3.8B — ELO 1025, #781/798 (above: OLMo 2 32B, below: DiscoLM-70B) AA MATH-500: 45.7 (#172/193) AA AIME 2025: 0.3 (#265/269) AA MMLU-Pro: 43.5 (#308/345) AA LiveCodeBench: 11.6 (#308/343) AA Long Context Reasoning: 2.0 (#345/411) AA Humanity's Last Exam: 4.4 (#372/479) AA IFBench: 23.9 (#382/411) AA Terminal-Bench Hard: 0.0 (#388/397) AA TAU-2 Bench: 0.0 (#398/402) Artificial Analysis Intelligence Index: 10.1 (#407/482) - OLMo 2 7B — ELO 958, #787/798 (above: hplt2c_nld_checkpoints, below: Qwen2.5-Math-1.5B) AA AIME 2025: 0.7 (#263/269) AA Humanity's Last Exam: 5.5 (#265/479) AA MMLU-Pro: 28.2 (#334/345) AA LiveCodeBench: 4.1 (#335/343) AA IFBench: 24.4 (#381/411) AA Terminal-Bench Hard: 0.0 (#390/397) AA Long Context Reasoning: 0.0 (#391/411) AA TAU-2 Bench: 0.0 (#399/402) Artificial Analysis Intelligence Index: 9.3 (#423/482) AA GPQA Diamond: 28.8 (#455/483)

NEW SCORES FROM TOP-10 MODELS (7) - Claude Mythos Preview on METR Benchmark: 17.41 50% Time Horizon (hours) (#1/26) - GPT-5.4 (xHigh) on OpenClawProBench: 68.0 Overall Score (%) (#8/59) - GPT-5.5 (xHigh) on OpenClawProBench: 69.3 Overall Score (%) (#4/59) - GPT-5.5 (xHigh) on Wolfram LLM Benchmarking Project: 68.8 Correct Functionality (%) (#6/451) - GPT-5.5 Pro on Epoch AI - ECI: 159.5 ECI Score (#3/365) - GPT-5.5 Pro on PinchBench: 18.11 Success Rate (%) (#39/41) - GPT-5.5 Pro on VoxelBench: 2107.0 Rating (#1/37)

NEW #1 LEADERS (14) - FoodTruckBench (Net Worth ($)): GPT-5.5 (61408.0) beat Claude Opus 4.6 (49519.0) by 11889.0 - LIBRA - ruBABILongQA2 (Dataset Total Score (%)): Qwen_Qwen3-30B-A3B-Instruct-2507 (64.72) beat GPT-4o (36.67) by 28.05 - LIBRA - ruQuALITY (Dataset Total Score (%)): 01-ai_Yi-9B-200K (95.9) beat GPT-4o (83.33) by 12.57 - SEAL - AudioMultiChallenge - Audio Output (Score): gpt-realtime-2 (xHigh) (48.45) beat gemini-3.1-flash-live-preview (Thinking) (36.06) by 12.39 - FrontierSWE (Dominance (%)): GPT-5.5 (83.0) beat Claude Opus 4.7 (74.0) by 9.0 - FrontierMath - Tier 4 (Accuracy (%, 48 problems)): AI co-mathematician (47.9) beat GPT-5.5 Pro (xhigh) (39.6) by 8.3 - Story Theory Bench (Score (%)): glm-5 (99.6) beat deepseek-v3.2 (92.2) by 7.4 - Kaggle FACTS Parametric (Score (%)): Gemini 3.1 Pro Preview (78.96) beat Gemini 3 Flash Preview (72.26) by 6.7 - SEAL - SWE Atlas - Codebase QnA (Score): GPT 5.5 (Codex) (45.43) beat Gpt 5.4 xHigh (Codex) (40.8) by 4.63 - LIBRA - ruSciAbstractRetrieval (Dataset Total Score (%)): Qwen_Qwen3-30B-A3B-Instruct-2507 (81.5) beat GLM-4 9B Chat (77.81) by 3.69 - Kaggle FACTS (Google) (Avg Score (%)): GPT-5.5 (71.19) beat Gemini 3.1 Pro Preview (67.71) by 3.48 - LIBRA - ruBABILongQA1 (Dataset Total Score (%)): Qwen_Qwen3-30B-A3B-Instruct-2507 (80.5) beat GPT-4o (78.33) by 2.17 - Android Bench (Score (%)): GPT 5.5 (74.0) beat GPT-5.4 (72.4) by 1.6 - ForecastBench (Overall Score (higher is better)): green tree (68.2) beat Cassi ensemble_2_crowdadj (67.8) by 0.4


View on AI Benchmark Hub

Don't miss what's next. Subscribe to Mikhail Doroshenko:
Powered by Buttondown, the easiest way to start and grow your newsletter.