Recursive MAS Cuts Tokens 35%, T2I Repaints Instead of Editing
- Recursive Scaling Moves From Single Models to Multi-Agent Systems. RecursiveMAS casts the entire multi-agent setup as one latent-space recursive computation, posting +8.3% accuracy on average across 9 benchmarks while cutting tokens 34.6%-75.6% and speeding inference 1.2-2.4x.
- T2I Refinement Works Better When You Repaint the Whole Image. Editing-based pipelines compress the modification space too aggressively. UniGenBench++ jumps from 61.53 to 77.41, calling the default "local refinement" path into question.
- Audio-Video Joint Training: Train Solo, Then Couple. Mutual Forcing uses two-stage training plus self-distillation, matching a 50-step baseline at 4-8 steps and dropping the external teacher model entirely.
- Asymmetric Debate Synthesizes Custom Guardrail Data. BARRED needs only a task description and a few unlabeled samples to outperform closed-source LLMs and dedicated guardrails. The recipe ports to any fuzzy-boundary classification task.
Also Notable
- DV-World Pulls Data Visualization Agent Eval Back to Real Workflows. Spreadsheet-native operations, cross-platform evolution, intent alignment. 260 tasks across three domains, hitting current DV agents where they live beyond single-language creation-only setups.
- Skill Graphs Synthesize Training Tasks for Terminal Agents. Eases the persistent shortage of high-quality execution trajectories and adds a training-data channel for command-line agents.
- FAMA: Failure-Aware Meta-Agentic Framework (ACL). Lets open-source LLMs in conversational tool-use benchmarks correct themselves from their own failure modes.
- A Systematic Post-Training Pipeline From Pretrain to Deployment for Video Diffusion. Targets prompt sensitivity and temporal degradation with a full RLHF/GRPO training framework.
- LVLM Hallucination Mitigation Moves From Decoding to Prefill Time. Steering vectors no longer fire during decoding. The intervention happens during prefill.
- CORAL: Multilingual RAG Needs an Adaptive Retrieval Loop (ACL). Retrieval space shouldn't be fixed to query/doc translation or multilingual embeddings. Cultural-alignment queries should expand on the fly.
- A Taxonomy of "Wrong Rewards" in Policy Gradient. Princeton argues that imperfect proxy rewards aren't all bad. Some types actually help training.
- Probing LLMs for Embodied Cognition Using This/That, 这/那. 6,400 native-speaker responses across languages test whether LLMs absorb spatial deixis and cultural variation from text alone.
Don't miss what's next. Subscribe to AI Research Brief: