AI Research Brief

Archives
March 27, 2026

Self-Distillation Strips Out Hesitation, OOD Drops 40%

  • Self-distillation strips out the model's ability to hesitate, not redundant steps. Once epistemic verbalization is suppressed, OOD performance drops up to 40%, and standard metrics won't catch it.
  • Coding agents produce 2.2x more redundancy than human projects. SlopCodeBench is the first benchmark to quantify tech debt across multi-turn iterations: all 11 models failed every task end-to-end, and prompt tuning doesn't fix the root cause.
  • The bottleneck for desktop agents is demonstration data, not model architecture. CUA-Suite pushes continuous human demo footage from under 20 hours to 55 hours. The best current model still fails about 60% of tasks.
  • Trained DiT models haven't actually converged. Adding one scaling coefficient per block (roughly 100 parameters total) improves generation quality, suggesting current training pipelines are systematically under-calibrated.

Also Notable

  • Self-Evolving Mobile GUI Agents From Failed Trajectories — Rejection fine-tuning plus credit assignment lets the model improve online through iterative self-play.
  • Only 9% of Agents Use Automated Iterative Optimization — The bottleneck isn't algorithms; it's implicit design decisions engineers must guess at.
  • VLMs Convert Raster Screenshots Back to Editable SVG — An automated solution for recovering design assets when source files are lost.
  • Microsoft Composer 2, Purpose-Built for Agentic Coding — Trained from scratch with emphasis on long-term planning over single-pass generation.
  • Detecting Intentional Violations in Agent Execution Traces — Not just failures: cases where the model knowingly deviates from instructions.
  • Fine-Grained Decomposition of Code Agent Failures — Finally distinguishing whether agents misunderstood requirements or botched execution.
  • Long-Sequence EHR Automation for Healthcare — A domain-specific computer-use agent in production-relevant healthcare systems.
  • Stronger MLLM Semantic Understanding Means Higher Malicious Image Risk — Capability gains and safety risks are positively correlated.
  • Testing Agent 3D Perception With Multi-View FPS Game Video — Multi-entity reasoning evaluation in rapidly changing environments.
  • Training-Free VLM Output Aggregation With Uncertainty Quantification — A no-training approach to reduce hallucination risk (ICLR).

Read the full edition →

Don't miss what's next. Subscribe to AI Research Brief:
Powered by Buttondown, the easiest way to start and grow your newsletter.