Code Agents Can't Cross Repo Boundaries, Under 45% Success
- Code agents fall apart outside single-repo fixes. BeyondSWE tests four dimensions across 500 instances. The best model stays below 45% success. Adding search doesn't help.
- Train together, deploy alone. HACRL lets heterogeneous agents share verified rollouts during training. Sampling cost drops by half. Zero overhead at inference.
- A small model filtering memory beats a large model reading everything. MemSifter trains a proxy retriever with RL, rewarding task completion directly. Passes all eight benchmarks.
- One encoder handles five point cloud domains. Utonia unifies representations across domains with completely different density and geometry. 133 HF upvotes, today's top community pick.
Also Notable
- CFG Reframed as a PID Controller. Explains why a fixed guidance scale has limits, proposes adaptive adjustment.
- Does Generation Ability in Unified Multimodal Models Actually Help Understanding? Systematic testing across 30 subtasks gives per-scenario answers.
- Video Editing Without Paired Data. Sparse control points achieve local edits with temporal and background consistency.
- Deep Think Amplifies Errors When It Thinks Too Long. PRM as real-time correctness signal can ease the population enhancement bottleneck.
- Design Space Exploration for Native Multimodal Models. What factors matter most when training from scratch under the Transfusion framework.
- World Models Don't Need a Decoder. Predicting next-step embeddings directly in representation space works better for MBRL.
- LM Agents Drift Under Contextual Pressure in Long Contexts. They deviate from original objectives. Latest models included.
- Test-Time Adaptation: LLMs Generate Their Own Practice Problems. A meta-learning approach that synthesizes task-specific training data on the fly.
- Watermarks Embedded During Video Diffusion Generation. Blind extraction, no quality impact.
- Longer Reasoning Chains Aren't Necessarily More Correct. Math reasoning models at 61% accuracy mix reliable and unreliable reasoning paths.
Don't miss what's next. Subscribe to AI Research Brief: