T²PO Stabilizes Multi-Turn RL; MotionCache Cuts Video Steps 6x
- Multi-Turn Agent RL Collapse May Not Be a Credit Assignment Problem. T²PO uses model self-uncertainty to trigger thinking and resampling. Stability and final performance both rise on WebShop, ALFWorld, and Search QA. ICML accepted.
- Factuality's Bottleneck Is Metacognition, Not Knowledge Volume. A position paper argues models still don't know what they don't know. Calibrated uncertainty is the hidden control layer in any agent reliability stack.
- A Better Scorecard for Putting Medical Agents to Work. PhysicianBench drops 100 real consultations into a commercial EHR environment. Each task averages 27 tool calls. Best agent pass@1 hits 46%; best open source only 19%.
- Pixel-Level Fix for Video Generation Caching. MotionCache assigns denoising steps per pixel using frame differences. SkyReels-V2 sees 6.28x speedup; MAGI-1 only 1.64x. Transfer depends heavily on the base model.
- What If Attention Is Just a Parameter-Prediction MLP. WeightFormer rewrites attention math as an MLP whose parameters are predicted from input. The linearization design goal shifts from "approximate softmax" to "predict good parameters."
Also Notable
- Students Picked 80 Real-Coursework Questions Agents Can't Solve — bilingual benchmark closer to real user failures than researcher-designed tests.
- Make a Model Count Repeated Symbols Until It Breaks — quantifiable minimal reliability test for the boundary between memorization and rule execution.
- Treat Agentic Systems as Token-Allocation Economies — position paper reframes the stack into four economic layers and argues for token-economy evaluation over text generation.
- 26.7M Spatial Proteomics Patches + H&E + Clinical Trimodal Contrastive Learning — Haiku actually delivered at scale, laying a multimodal foundation model floor for spatial biology.
- Brain MRI Foundation Model SAEs Collapse in Deep Layers — authors stabilize SAEs with geometric priors, adding an interpretability tool for medical imaging foundation models.
- Game-Engine Synthetic Data Still Has a Visible Sim2Real Gap — even with ray tracing, real images differ visibly; the hybrid approach narrows the gap for synthetic-data training pipelines.
Don't miss what's next. Subscribe to AI Research Brief: