12B Beats GPT-4, Distilled Students Surpass Teachers
- Generative recommendation's "generalization advantage" degrades to token-level memorization at closer inspection. Per-instance fusion of both paradigms beats picking sides.
- Security compliance audits may be the ideal agent use case thanks to explicit standards and scarce human experts. Domain fine-tuning catches risks general models miss, but context windows are the real bottleneck.
- Long-horizon web agents fail because they lack intermediate checkpoints. Subgoal decomposition lifts a 12B open model from 6.4% to 43% success rate, surpassing GPT-4-class systems.
- Discrete diffusion finally has a working distillation method. D-MMD validates on both text and image domains, with the student model outperforming its teacher.
Also Notable
- Do 2D Foundation Models Actually Understand 3D? — Systematic probing of implicit 3D capabilities across multiple models, using an agent framework to guide full 3D scene generation.
- Logic-Flow-Guided Active Grounding in Long Videos — Avoids brute-force frame-by-frame parsing, cutting compute costs substantially.
- Precise Identity and Attribute Binding in Multi-Person Video — Tackles the persistent attribute-misassignment problem in multi-person scenes.
- Single-Cell Foundation Models Transfer to Spatial Transcriptomics — Predicts gene expression directly from tissue slice images, lowering spatial omics costs.
- Audio-Visual Navigation in Continuous Environments — Drops the dependency on precomputed room impulse responses, moving sound-guided navigation closer to real deployment.
- Multimodal Graph Networks for Style-Consistent Indoor Scenes — Joint geometry and appearance generation using rectified flow.
- Failure Mode Decomposition for Autonomous Driving Mapping — Diagnostic framework distinguishing whether a model memorizes input features or genuinely generalizes.
- VLM Attribute Disentanglement for Cross-Domain Person Re-ID — Uses vision-language model attribute separation to improve retrieval robustness in lifelong learning settings.
- LED Blinking + Event Cameras for Millisecond Motion Capture — Bypasses traditional frame-rate limits, achieving millisecond-level motion timing precision.
Don't miss what's next. Subscribe to AI Research Brief: