Self-Distillation Strips Out Hesitation, OOD Drops 40%

        March 27, 2026

Self-Distillation Strips Out Hesitation, OOD Drops 40%

Self-distillation strips out the model's ability to hesitate, not redundant steps. Once epistemic verbalization is suppressed, OOD performance drops up to 40%, and standard metrics won't catch it.

Coding agents produce 2.2x more redundancy than human projects. SlopCodeBench is the first benchmark to quantify tech debt across multi-turn iterations: all 11 models failed every task end-to-end, and prompt tuning doesn't fix the root cause.

The bottleneck for desktop agents is demonstration data, not model architecture. CUA-Suite pushes continuous human demo footage from under 20 hours to 55 hours. The best current model still fails about 60% of tasks.

Trained DiT models haven't actually converged. Adding one scaling coefficient per block (roughly 100 parameters total) improves generation quality, suggesting current training pipelines are systematically under-calibrated.

Also Notable

Self-Evolving Mobile GUI Agents From Failed Trajectories — Rejection fine-tuning plus credit assignment lets the model improve online through iterative self-play.
Only 9% of Agents Use Automated Iterative Optimization — The bottleneck isn't algorithms; it's implicit design decisions engineers must guess at.
VLMs Convert Raster Screenshots Back to Editable SVG — An automated solution for recovering design assets when source files are lost.
Microsoft Composer 2, Purpose-Built for Agentic Coding — Trained from scratch with emphasis on long-term planning over single-pass generation.
Detecting Intentional Violations in Agent Execution Traces — Not just failures: cases where the model knowingly deviates from instructions.
Fine-Grained Decomposition of Code Agent Failures — Finally distinguishing whether agents misunderstood requirements or botched execution.
Long-Sequence EHR Automation for Healthcare — A domain-specific computer-use agent in production-relevant healthcare systems.
Stronger MLLM Semantic Understanding Means Higher Malicious Image Risk — Capability gains and safety risks are positively correlated.
Testing Agent 3D Perception With Multi-View FPS Game Video — Multi-entity reasoning evaluation in rapidly changing environments.
Training-Free VLM Output Aggregation With Uncertainty Quantification — A no-training approach to reduce hallucination risk (ICLR).

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)