Agents Start Improving Themselves, and Reaching for Fewer Tools
- A Chinese MoE puts "self-evolution" on the roadmap. MiniMax-M2 runs 230B params with only 9.8B active, built end-to-end for agent work, and its latest checkpoint can already debug its own training and rewrite its own scaffold.
- The biggest waste in parallel reasoning is branches thinking in isolation. CPT lets thinking branches share intermediate findings in real time, training-free, and pushes the accuracy-latency curve forward on competition math.
- RL-trained agents drift into over-calling tools. AKBE teaches a model when to look something up versus trust its own knowledge: 18% fewer tool calls, higher accuracy, 25% better tool efficiency.
- A skill shouldn't be a throwaway script. MUSE-Autoskill gives agent skills a full lifecycle so they carry experience across tasks and fix themselves through unit tests.
Also Notable
- Benchmarks Stop Asking "Can It Replace Humans" and Start Asking "What Do People Want Agents to Do" — JobBench covers 130 real office tasks across 35 occupations, and even the strongest, Claude Opus 4.7, hits only 45.9%, deliberately reframing the goal from replacement to augmentation.
- Let a VLM Play Werewolf and Half Its Accusations Are Made Up — QUACK checks agent statements sentence by sentence against the true trajectory, and the best model still hallucinates 15.1% of spatial descriptions, with half of its accusations unsupported by evidence.
- Can an Agent Remember Your Preferences? Long-Term Interaction Exposes the Gap — VitaBench 2.0 turns tasks into time-ordered user sequences with preferences buried in everyday fragments, requiring the agent to keep extracting and updating, and frontier models still fall well short.
- Minute-Long Audio-Video Generation, and Nobody Tested Where It Breaks Over Time — LongAV-Compass uses 284 cases across text, image, and video conditions, comparing 11 models on 20-plus dimensions from identity consistency to narrative coherence.
- Multi-View 3D Reconstruction Falls Apart on Degraded Inputs — GARD runs diffusion denoising directly in the reconstruction model's feature space, restoring geometry and high-resolution RGB images together.
- Scientific Simulation Wants Fast and Accurate, and RecFM Claims 20x Speedup With Better Accuracy — recursive flow matching uses cross-scale self-consistency to approach multi-step solvers in 2-4 steps, cutting error by over 15%.
- That Unremarkable Scaling Vector in the Norm Layer — Delete It and the Model Won't Train — its parameter share is negligible, yet it improves optimization through a "self-amplifying preconditioning" effect, and the paper offers three lightweight improvements.
- "LLMs Can Introspect" May Be a Premature Conclusion — a reality check argues the so-called self-state recognition looks more like generic anomaly detection and pattern matching, dropping to near-random once you control for confounds.
- Unlearning Requests Keep Coming, and Fine-Tuning Each One Costs Too Much — ICCU leaves parameters untouched, deriving readable refusal rules from the unlearning data and applying them at inference, where the rules compose without interfering.
Don't miss what's next. Subscribe to AI Research Brief: