A 20B Search Agent Ties the Frontier by Offloading Its Bookkeeping
- Deleting stale observations to save context follows an inverted-U, not a straight line. Sweep across 4B to 284B models and three retrievers: strong retrievers plus mid-size models win biggest, but a model that's already strong loses accuracy when masking deletes evidence it still needs.
- Move the bookkeeping out of the policy and into the environment, and a 20B searcher hits 0.730 average recall. That's 11.4 points over the next-best open searcher, with the largest gains on held-out transfer benchmarks.
- Stuffing charts into a report is easy. Getting them factually right is the part nobody checks. TVIR uses 100 expert-curated multimodal research tasks and scores visual reliability and text alignment as their own dimension.
- Teach a model to infer intent with zero labels. MindZero turns a planner's behavior explainability into a self-supervised reward, trains with heavy reasoning, and distills to a single forward pass at deployment, beating slow model-based methods in gridworld and home scenarios.
Also Notable
- Scaling Test-Time Compute for Agentic Search Hits a Calibration Trap When Correct Answers Are Sparse. FineVerify breaks a question into verifiable sub-questions and checks each candidate piece by piece, structuring "judging correctness" out of the policy too — a third cut at today's masking/externalize theme.
Don't miss what's next. Subscribe to AI Research Brief: