Document Agents Navigate by Luck, Prefill Speeds Up 1.82x

        March 14, 2026

Document Agents Navigate by Luck, Prefill Speeds Up 1.82x

Document Agents' Reasoning Is Overestimated. MADQA's benchmark, designed with classical test theory, shows the best multimodal agents match human accuracy but navigate more like random search than strategic reasoning. Nearly 20% gap to Oracle remains.

Understanding 3D Space Doesn't Need Longer Context Windows. Spatial-TTT updates model parameters at test time, learning spatial structure on the fly. Major gains on long-video tasks.

Sparse Attention's Indexer Became the New Bottleneck. IndexCache reuses indices across layers by exploiting high overlap in adjacent-layer attention patterns. 75% indexer compute eliminated. 1.82x prefill speedup on a 30B model with near-zero quality loss.

Reward Model Hallucinations Are the Hidden Bottleneck in RL-Optimized Image Generation. FIRM trains an 8B critic on 600K+ purpose-built samples using a Base-and-Bonus strategy to prevent single-metric misguidance. Fully open-sourced.

Also Notable

Allocating Equal Tokens to Static and Dynamic Segments Is Wasteful — EVATok adaptively assigns token lengths based on content complexity. CVPR.
Chain-of-Thought Reasoning Inside the Diffusion Model, Not a Single-Step MLLM Encoder — guidance updates dynamically as reasoning depth increases during generation.
Extracting Both Experiences and Skills as Reusable Knowledge From Interaction Trajectories — continuously improves agent tool calling without parameter updates.
Camera Motion Control for Text-Driven Multi-Shot Video — learns the joint distribution of captions, trajectories, and video in a data-driven way.
Task-Expert Solutions Cluster Densely Around Pretrained Weights — large models can find them by random sampling, no gradient descent needed.
First Framework to Deterministically Convert a Video Diffusion Model Into a Single-Pass Depth Regressor — eliminates stochastic geometric hallucinations from generative approaches.
How to Allocate Sampling Compute in LLM RL Post-Training — CMU provides optimal ratios under iso-compute curves.
AI-Generated Content Contaminating Training Data Causes Model Collapse — proportional real-data replay effectively delays degradation.
Stanford Dissects Deployment Reliability of Learned Robot Policies — distribution shift, error accumulation, and task dependency chains as three failure dimensions.
MoE+LoRA Dynamic Routing's Actual Inference Cost Far Exceeds Theoretical FLOPs — AdaFuse uses token-level pre-gating and fused kernels to close this gap.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)