Streaming Hand-offs Make Multi-Agent Sharper, ZipSplat Splats With 1/6 the Gaussians
- Streaming Hand-offs Beat Waiting for the Full Chain. StreamMA pipelines adjacent agents so reliable early signals reach downstream sooner — average +7.3 points across eight math/science/code benchmarks, up to 22.4 on HMMT 2026.
- Your LLM Judge's Reward Is Being Quietly Gamed. CHERRL injects known biases on purpose to build a controlled environment where reward hacking in rubric-based RL reproduces reliably and can be pinpointed.
- A Blank Wall and a Complex Object Shouldn't Cost the Same Gaussians. ZipSplat decouples Gaussian placement from the pixel grid using tokens, beating pixel-aligned methods on two benchmarks with roughly 1/6 the Gaussians and no camera poses.
- Specs as Explicit Constraints Put an Agent Framework Into Production. MapAgent runs in Baidu Maps across 360+ cities for lane-level mapping, treating mapping specs and traffic law as reasoning constraints instead of implicit supervision.
Also Notable
- On-Policy Self-Distillation Adds Dense Supervision to Sparse-Reward RL — the model conditions on privileged context to supervise its own generations, with full-vocab reverse KL as an auxiliary loss. Self-Distilled Policy Gradient.
- Token-Level Advantage Reweighting for RLVR — instead of broadcasting one sequence-level advantage to every token, it redistributes gradient by token contribution. GRAIL: Gradient-Reweighted Advantages for RLVR.
- The First Benchmark to Systematically Test Long-Video Model Memory — what it retains, how accurately, and how well it resists interference, with tasks designed from cognitive science. M³Eval: Multi-Modal Memory Evaluation.
- A Benchmark for Ultra-Long-Horizon Closed-Loop Research and Engineering — tests whether frontier models can keep proposing improvements, running experiments, and iterating on results rather than answering once. AutoLab: Long-Horizon Auto Research and Engineering.
- Give Vision Encoders State — across multi-image comparison, it stops encoding each image in isolation and smoothing away the task-critical small changes. Stateful Visual Encoders for VLMs.
- Hand Long, Cross-Referenced Rule Sets to an Agentic Harness for Deductive Reasoning — for tax filing and immigration precedent, where written rules must be applied clause by clause. DAR: Deontic Reasoning with Agentic Harnesses.
- Sparse-Voxel-Guided Autoregressive Mesh Generation — tackles the long-standing problem of token sequences too long to scale. MeshWeaver: Sparse-Voxel-Guided Surface Weaving.
- LLMs Look Cautious, but the Mechanism May Not Match Humans — probed with the St. Petersburg paradox, the outputs resemble human risk preference while the decision mechanism doesn't. Probing LLM Risk Decisions via the St. Petersburg Game.
- An Agent-Curated Benchmark for AIGC Manipulation Localization — closer to real local image edits than existing datasets. Impostor: Realistic AIGC Manipulation Localization.
- Algebra-Preserving Deep Koopman Learning — linearizes nonlinear dynamics more reliably. Deep Embedded Multiplicative DMD.
Don't miss what's next. Subscribe to AI Research Brief: