12B Beats GPT-4, Distilled Students Surpass Teachers

        March 23, 2026

12B Beats GPT-4, Distilled Students Surpass Teachers

Generative recommendation's "generalization advantage" degrades to token-level memorization at closer inspection. Per-instance fusion of both paradigms beats picking sides.

Security compliance audits may be the ideal agent use case thanks to explicit standards and scarce human experts. Domain fine-tuning catches risks general models miss, but context windows are the real bottleneck.

Long-horizon web agents fail because they lack intermediate checkpoints. Subgoal decomposition lifts a 12B open model from 6.4% to 43% success rate, surpassing GPT-4-class systems.

Discrete diffusion finally has a working distillation method. D-MMD validates on both text and image domains, with the student model outperforming its teacher.

Also Notable

Do 2D Foundation Models Actually Understand 3D? — Systematic probing of implicit 3D capabilities across multiple models, using an agent framework to guide full 3D scene generation.
Logic-Flow-Guided Active Grounding in Long Videos — Avoids brute-force frame-by-frame parsing, cutting compute costs substantially.
Precise Identity and Attribute Binding in Multi-Person Video — Tackles the persistent attribute-misassignment problem in multi-person scenes.
Single-Cell Foundation Models Transfer to Spatial Transcriptomics — Predicts gene expression directly from tissue slice images, lowering spatial omics costs.
Audio-Visual Navigation in Continuous Environments — Drops the dependency on precomputed room impulse responses, moving sound-guided navigation closer to real deployment.
Multimodal Graph Networks for Style-Consistent Indoor Scenes — Joint geometry and appearance generation using rectified flow.
Failure Mode Decomposition for Autonomous Driving Mapping — Diagnostic framework distinguishing whether a model memorizes input features or genuinely generalizes.
VLM Attribute Disentanglement for Cross-Domain Person Re-ID — Uses vision-language model attribute separation to improve retrieval robustness in lifelong learning settings.
LED Blinking + Event Cameras for Millisecond Motion Capture — Bypasses traditional frame-rate limits, achieving millisecond-level motion timing precision.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)