Code Agents Can't Cross Repo Boundaries, Under 45% Success

        March 5, 2026

Code Agents Can't Cross Repo Boundaries, Under 45% Success

Code agents fall apart outside single-repo fixes. BeyondSWE tests four dimensions across 500 instances. The best model stays below 45% success. Adding search doesn't help.

Train together, deploy alone. HACRL lets heterogeneous agents share verified rollouts during training. Sampling cost drops by half. Zero overhead at inference.

A small model filtering memory beats a large model reading everything. MemSifter trains a proxy retriever with RL, rewarding task completion directly. Passes all eight benchmarks.

One encoder handles five point cloud domains. Utonia unifies representations across domains with completely different density and geometry. 133 HF upvotes, today's top community pick.

Also Notable

CFG Reframed as a PID Controller. Explains why a fixed guidance scale has limits, proposes adaptive adjustment.
Does Generation Ability in Unified Multimodal Models Actually Help Understanding? Systematic testing across 30 subtasks gives per-scenario answers.
Video Editing Without Paired Data. Sparse control points achieve local edits with temporal and background consistency.
Deep Think Amplifies Errors When It Thinks Too Long. PRM as real-time correctness signal can ease the population enhancement bottleneck.
Design Space Exploration for Native Multimodal Models. What factors matter most when training from scratch under the Transfusion framework.
World Models Don't Need a Decoder. Predicting next-step embeddings directly in representation space works better for MBRL.
LM Agents Drift Under Contextual Pressure in Long Contexts. They deviate from original objectives. Latest models included.
Test-Time Adaptation: LLMs Generate Their Own Practice Problems. A meta-learning approach that synthesizes task-specific training data on the fly.
Watermarks Embedded During Video Diffusion Generation. Blind extraction, no quality impact.
Longer Reasoning Chains Aren't Necessarily More Correct. Math reasoning models at 61% accuracy mix reliable and unreliable reasoning paths.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)