AI Research Brief

Archives
Log in
May 30, 2026

Agents Start Improving Themselves, and Reaching for Fewer Tools

  • A Chinese MoE puts "self-evolution" on the roadmap. MiniMax-M2 runs 230B params with only 9.8B active, built end-to-end for agent work, and its latest checkpoint can already debug its own training and rewrite its own scaffold.
  • The biggest waste in parallel reasoning is branches thinking in isolation. CPT lets thinking branches share intermediate findings in real time, training-free, and pushes the accuracy-latency curve forward on competition math.
  • RL-trained agents drift into over-calling tools. AKBE teaches a model when to look something up versus trust its own knowledge: 18% fewer tool calls, higher accuracy, 25% better tool efficiency.
  • A skill shouldn't be a throwaway script. MUSE-Autoskill gives agent skills a full lifecycle so they carry experience across tasks and fix themselves through unit tests.

Also Notable

  • Benchmarks Stop Asking "Can It Replace Humans" and Start Asking "What Do People Want Agents to Do" — JobBench covers 130 real office tasks across 35 occupations, and even the strongest, Claude Opus 4.7, hits only 45.9%, deliberately reframing the goal from replacement to augmentation.
  • Let a VLM Play Werewolf and Half Its Accusations Are Made Up — QUACK checks agent statements sentence by sentence against the true trajectory, and the best model still hallucinates 15.1% of spatial descriptions, with half of its accusations unsupported by evidence.
  • Can an Agent Remember Your Preferences? Long-Term Interaction Exposes the Gap — VitaBench 2.0 turns tasks into time-ordered user sequences with preferences buried in everyday fragments, requiring the agent to keep extracting and updating, and frontier models still fall well short.
  • Minute-Long Audio-Video Generation, and Nobody Tested Where It Breaks Over Time — LongAV-Compass uses 284 cases across text, image, and video conditions, comparing 11 models on 20-plus dimensions from identity consistency to narrative coherence.
  • Multi-View 3D Reconstruction Falls Apart on Degraded Inputs — GARD runs diffusion denoising directly in the reconstruction model's feature space, restoring geometry and high-resolution RGB images together.
  • Scientific Simulation Wants Fast and Accurate, and RecFM Claims 20x Speedup With Better Accuracy — recursive flow matching uses cross-scale self-consistency to approach multi-step solvers in 2-4 steps, cutting error by over 15%.
  • That Unremarkable Scaling Vector in the Norm Layer — Delete It and the Model Won't Train — its parameter share is negligible, yet it improves optimization through a "self-amplifying preconditioning" effect, and the paper offers three lightweight improvements.
  • "LLMs Can Introspect" May Be a Premature Conclusion — a reality check argues the so-called self-state recognition looks more like generic anomaly detection and pattern matching, dropping to near-random once you control for confounds.
  • Unlearning Requests Keep Coming, and Fine-Tuning Each One Costs Too Much — ICCU leaves parameters untouched, deriving readable refusal rules from the unlearning data and applying them at inference, where the rules compose without interfering.

Read the full edition →

Don't miss what's next. Subscribe to AI Research Brief:
Powered by Buttondown, the easiest way to start and grow your newsletter.