AI Research Brief

Archives
March 5, 2026

Code Agents Can't Cross Repo Boundaries, Under 45% Success

  • Code agents fall apart outside single-repo fixes. BeyondSWE tests four dimensions across 500 instances. The best model stays below 45% success. Adding search doesn't help.
  • Train together, deploy alone. HACRL lets heterogeneous agents share verified rollouts during training. Sampling cost drops by half. Zero overhead at inference.
  • A small model filtering memory beats a large model reading everything. MemSifter trains a proxy retriever with RL, rewarding task completion directly. Passes all eight benchmarks.
  • One encoder handles five point cloud domains. Utonia unifies representations across domains with completely different density and geometry. 133 HF upvotes, today's top community pick.

Also Notable

  • CFG Reframed as a PID Controller. Explains why a fixed guidance scale has limits, proposes adaptive adjustment.
  • Does Generation Ability in Unified Multimodal Models Actually Help Understanding? Systematic testing across 30 subtasks gives per-scenario answers.
  • Video Editing Without Paired Data. Sparse control points achieve local edits with temporal and background consistency.
  • Deep Think Amplifies Errors When It Thinks Too Long. PRM as real-time correctness signal can ease the population enhancement bottleneck.
  • Design Space Exploration for Native Multimodal Models. What factors matter most when training from scratch under the Transfusion framework.
  • World Models Don't Need a Decoder. Predicting next-step embeddings directly in representation space works better for MBRL.
  • LM Agents Drift Under Contextual Pressure in Long Contexts. They deviate from original objectives. Latest models included.
  • Test-Time Adaptation: LLMs Generate Their Own Practice Problems. A meta-learning approach that synthesizes task-specific training data on the fly.
  • Watermarks Embedded During Video Diffusion Generation. Blind extraction, no quality impact.
  • Longer Reasoning Chains Aren't Necessarily More Correct. Math reasoning models at 61% accuracy mix reliable and unreliable reasoning paths.

Read the full edition →

Don't miss what's next. Subscribe to AI Research Brief:
Powered by Buttondown, the easiest way to start and grow your newsletter.