AI Benchmark Digest — 2026-04-10


            
        April 10, 2026
    
    
AI Benchmark Digest — 2026-04-10


        AI Benchmark Digest — 2026-04-10
=== DAILY ===
NEW BENCHMARKS (9)
  - WebApp1K (Pass@1 (%)): leader o3-mini (96.1), 34 models
  - WebApp1K Duo (Pass@1 (%)): leader gpt-5 (79.38), 49 models
  - RepairBench (Plausible@1 (%)): leader o4-mini-2025-04-16-high (50.3), 39 models
  - AA APEX-Agents (Pass@1 (%)): leader GPT-5.4 (xhigh) (33.3), 17 models
  - Context-Bench Filesystem (Rubric Score (%)): leader gpt-5.2-codex-xhigh (93.0), 15 models
  - Context-Bench Skills (Task Completion (%)): leader gpt-5.2-2025-12-11 (xhigh) (85.31), 22 models
  - Translation (Lechmazur) (Mean Score): leader GPT-5 (medium reasoning) (8.69), 8 models
  - Deception Effectiveness (Lechmazur) (Deception Score): leader Claude 3.5 Sonnet (1.099), 18 models
  - Deception Resistance (Lechmazur) (Vulnerability Score (lower is better)): leader Claude 3 Opus (0.277), 18 models
NEW MODELS (52)
  - medllama3-v20 — ELO 1719, #138/949 (above: O1 Pro, below: Nova 2.0 Lite (Medium))
      Open Medical LLM: 90.01 (#1/181)
      Open Medical LLM - MedMCQA: 75.4 (#1/181)
      Open Medical LLM - MedQA (USMLE): 81.07 (#1/181)
      Open Medical LLM - MMLU Anatomy: 91.85 (#1/181)
      Open Medical LLM - MMLU Clinical Knowledge: 95.85 (#1/181)
      Open Medical LLM - MMLU College Biology: 98.61 (#1/181)
      Open Medical LLM - MMLU College Medicine: 94.8 (#1/181)
      Open Medical LLM - MMLU Medical Genetics: 98.0 (#1/181)
      Open Medical LLM - MMLU Professional Medicine: 98.9 (#1/181)
  - OpenBioLLMLlama-70B — ELO 1589, #285/949 (above: DeepSeek V3, below: O3 Mini (Low))
      Open Medical LLM: 86.06 (#2/181)
      Open Medical LLM - MedMCQA: 74.01 (#2/181)
      Open Medical LLM - MedQA (USMLE): 78.16 (#2/181)
      Open Medical LLM - MMLU Anatomy: 83.9 (#2/181)
      Open Medical LLM - MMLU Clinical Knowledge: 92.93 (#2/181)
      Open Medical LLM - MMLU College Biology: 93.83 (#2/181)
      Open Medical LLM - MMLU College Medicine: 85.75 (#2/181)
      Open Medical LLM - MMLU Medical Genetics: 93.2 (#2/181)
      Open Medical LLM - MMLU Professional Medicine: 93.75 (#2/181)
      Open Medical LLM - PubMedQA: 78.97 (#2/181)
  - orpo_med_v3 — ELO 1437, #549/949 (above: DeepSeek V2.5, below: Llama-medx_v3.1)
  - JSL-MedLlama-3-8B-v2.0 — ELO 1436, #552/949 (above: Mixtral 8x22B, below: Llama-medx_v3)
  - orpo_med_v2 — ELO 1434, #554/949 (above: Llama-medx_v3, below: Llama-3-OpenBioMed-8B-slerp-v0.3)
  - Llama-3-OpenBioMed-8B-slerp-v0.3 — ELO 1433, #555/949 (above: orpo_med_v2, below: Llama3-merge-biomed-8B)
  - Llama-3-Galen-8B-32k-v1 — ELO 1429, #562/949 (above: Qwen 2.5 Turbo, below: Qwen-VL-Plus)
  - OpenBioLLM-Llama3-8B — ELO 1427, #564/949 (above: Qwen-VL-Plus, below: Qwen2.5-Coder-32B)
      Open Medical LLM - MMLU Medical Genetics: 86.1 (#3/181)
  - JSL-MedLlama-3-8B-v1.0 — ELO 1423, #574/949 (above: orpo_med_v0, below: OLMo 3 7B (Thinking))
  - orpo_v2 — ELO 1422, #576/949 (above: OLMo 3 7B (Thinking), below: Ling-mini-2.0)
  - Llama3-OpenBioLLM-8B — ELO 1419, #580/949 (above: Llama-3-MixSense-v1.1, below: Llama 3 70B)
  - ollama_v9 — ELO 1416, #586/949 (above: InternVL2-1B, below: Med-ChimeraLlama-3.1k_5_epoch)
  - Llama-3-OpenBioMed-8B-dare-ties-v1.0 — ELO 1405, #621/949 (above: XVERSE-V-13B, below: Llama3-Aloe-8B-Alpha)
  - ollama_v6 — ELO 1404, #624/949 (above: medllama3-v4, below: llama-3-merged-linear)
  - JSL-Med-Sft-Llama-3-8B — ELO 1402, #632/949 (above: ChimeraLlama-3-8B-v3, below: Gemma 3 4B)
  - medllama3-v16 — ELO 1401, #637/949 (above: LLaVA-Next-Llama3, below: Llama-medx_v0)
  - ollama-3-8B — ELO 1400, #644/949 (above: MiniCPM-V-2, below: ollama_v5)
  - ollama_v5 — ELO 1400, #645/949 (above: ollama-3-8B, below: Mantis-8B-Idefics2)
  - JSL-MedMNX-7B-v2.0 — ELO 1395, #655/949 (above: LLaVA-Next-Vicuna-13B, below: suzume-llama-3-8B-multilingual)
  - Yi-9B-Forest-DPO-v1.0 — ELO 1394, #658/949 (above: Master-Yi-9B, below: ai-medical-model-32bit)
  - ollama_v7 — ELO 1392, #663/949 (above: medllama3-v11, below: StableBeluga2)
  - tnayajv2.0 — ELO 1390, #667/949 (above: TransCore-M, below: GPT-3.5 Turbo)
  - Collaiborator-MEDLLM-Llama-3-8B — ELO 1380, #684/949 (above: Exaone 4.0 1.2B, below: Collaiborator-MEDLLM-Llama-3-8B-v2-5)
  - Collaiborator-MEDLLM-Llama-3-8B-v2-5 — ELO 1380, #685/949 (above: Collaiborator-MEDLLM-Llama-3-8B, below: Collaiborator-MEDLLM-Llama-3-8B-v2-6)
  - Collaiborator-MEDLLM-Llama-3-8B-v2-6 — ELO 1380, #686/949 (above: Collaiborator-MEDLLM-Llama-3-8B-v2-5, below: JSL-MedMNX-7B)
  - JSL-MedMNX-7B — ELO 1380, #687/949 (above: Collaiborator-MEDLLM-Llama-3-8B-v2-6, below: PaLM 540B)
  - Collaiborator-MEDLLM-Llama-3-8B-v2-1 — ELO 1379, #689/949 (above: PaLM 540B, below: Command-R+)
  - Llama-3-Orca-1.0-8B — ELO 1377, #691/949 (above: Command-R+, below: JSL-MedMNX-7B-SFT)
  - JSL-MedMNX-7B-SFT — ELO 1377, #692/949 (above: Llama-3-Orca-1.0-8B, below: Collaiborator-MEDLLM-Llama-3-8B-v2-4)
  - Collaiborator-MEDLLM-Llama-3-8B-v2-4 — ELO 1377, #693/949 (above: JSL-MedMNX-7B-SFT, below: Qwen 3 1.7B)
  - Collaiborator-MEDLLM-Llama-3-8B-v2-3 — ELO 1374, #699/949 (above: Llama-3-Smaug-8B, below: LLaVA-InternLM2-20B (QLoRA))
  - Collaiborator-MEDLLM-Llama-3-8B-v1 — ELO 1373, #701/949 (above: LLaVA-InternLM2-20B (QLoRA), below: SOLAR-10.7B-Instruct-v1.0)
  - Yi-1.5-dolphin-9B — ELO 1367, #713/949 (above: Sarvam M, below: Starling-LM-7B-beta)
  - lft_8b_v2 — ELO 1363, #718/949 (above: Power-Llama-3-7B-Instruct, below: OLMo 3 7B)
  - tnayaj — ELO 1360, #728/949 (above: Jamba 1.6 Mini, below: Yi-VL-34B)
  - Myrrh_solar_10.7b_3.0 — ELO 1354, #740/949 (above: Sakura-SOLAR-Instruct-CarbonVillain-en-10.7B-v2-slerp, below: Parrot-7B)
  - Med-Yi-1.5-9B — ELO 1351, #746/949 (above: LLaVA-OneVision-0.5B, below: Llama 3 8B)
  - Lumina-3.5 — ELO 1350, #751/949 (above: VILA1.5-3B, below: BioMistral-DARE-NS)
  - Hercules-3.1-Mistral-7B — ELO 1346, #765/949 (above: AlphaMonarch-7B, below: LFM 40B)
  - MAmmoTH2-8B-Plus — ELO 1332, #786/949 (above: BioMistral-7B-Zephyr-Beta-SLERP, below: internlm-20B)
  - BioLing-7B-Dare — ELO 1331, #790/949 (above: Janus-1.3B, below: Granite 4.0 Micro)
  - Apollo-6B — ELO 1329, #795/949 (above: Slime-7B, below: mPLUG-Owl2)
  - BioMistral-Zephyr-Beta-SLERP — ELO 1327, #799/949 (above: Mantis-8B-clip-llama3, below: Emu2_chat)
  - Bio-Mistralv2-Squared — ELO 1326, #801/949 (above: Emu2_chat, below: gemma-7B)
  - Apollo-7B — ELO 1322, #810/949 (above: Llama 3 3B, below: Molmo2-8B)
  - MedMistral-instruct — ELO 1320, #814/949 (above: Granite 3.3 8B, below: BioMistral-7B-SLERP)
  - MediKAI — ELO 1283, #847/949 (above: Apertus 8B Instruct, below: Jamba 1.7 Mini)
  - JSL-MedPhi2-2.7B — ELO 1282, #849/949 (above: Jamba 1.7 Mini, below: Gopher (280B))
  - EMO-2B — ELO 1174, #898/949 (above: Qwen3 Coder, below: RedPajama-INCITE-7B-Base)
  - MELT-TinyLlama-1.1B-Chat-v1.0 — ELO 1158, #912/949 (above: Gemma 3 270M, below: Grok Code Fast)
  - Healix-1.1B-V1-Chat-dDPO — ELO 1077, #940/949 (above: mega-ar-126m-4k, below: GPT2_PMC)
  - GPT2_PMC — ELO 1076, #941/949 (above: Healix-1.1B-V1-Chat-dDPO, below: mega-ar-525m-v0.07-ultraTBfw)
NEW #1 LEADERS (10)
  - Open Medical LLM - MMLU College Medicine (Accuracy (%)): medllama3-v20 (94.8) beat Flan-PaLM (76.3) by 18.5
  - Open Medical LLM - MMLU Professional Medicine (Accuracy (%)): medllama3-v20 (98.9) beat Flan-PaLM (83.8) by 15.1
  - Open Medical LLM - MMLU Anatomy (Accuracy (%)): medllama3-v20 (91.85) beat Llama-medx_v3.2 (77.04) by 14.81
  - Open Medical LLM (Average Accuracy (%)): medllama3-v20 (90.01) beat Llama-medx_v3.2 (75.42) by 14.59
  - Open Medical LLM - MedMCQA (Accuracy (%)): medllama3-v20 (75.4) beat Llama-medx_v3.1 (61.3) by 14.1
  - Open Medical LLM - MMLU Clinical Knowledge (Accuracy (%)): medllama3-v20 (95.85) beat Llama-medx_v3.2 (82.26) by 13.59
  - Open Medical LLM - MMLU Medical Genetics (Accuracy (%)): medllama3-v20 (98.0) beat orpo_med_v0 (85.0) by 13.0
  - Open Medical LLM - MedQA (USMLE) (Accuracy (%)): medllama3-v20 (81.07) beat llama3-8B-slerp-med-chinese (68.11) by 12.96
  - Open Medical LLM - MMLU College Biology (Accuracy (%)): medllama3-v20 (98.61) beat Flan-PaLM (88.9) by 9.71
  - Design Arena (Slides) (Elo): claude-pptx-opus (1271.0) beat honeydew (1270.0) by 1.0

View on AI Benchmark Hub
    

                                Don't miss what's next. Subscribe to Mikhail Doroshenko:
                            
                        
            Email address (required)