Speculative Execution Hits Agent Loops, 3x Faster
- Speculative Execution Comes to Agent Loops, Up to 3.35x Speedup. SpecEyes borrows CPU branch prediction for multimodal agents: a small model predicts trajectories, launches vision tool calls in parallel. Accuracy holds or improves.
- VLM Speedup Without Dropping Visual Tokens. VISOR replaces dense self-attention with sparse cross-attention, letting the language model query vision on demand. Full visual information retained, compute cost cut sharply. (CVPR)
- World Model Datasets Need Structure, Not Scale. WildWorld provides 108M frames with explicit action-state-observation decoupling, exposing the design flaw of coupling actions directly to pixels.
- RL Training Across Text and Image Generation Now Has a Unified Framework. UniGRPO models autoregressive text and flow-matching images as a single MDP, giving mixed-architecture post-training a reusable baseline.
Also Notable
- GRPO Trains Video Agents to Select Frames Adaptively — No more brute-force full-frame processing; RL teaches the agent which frames are worth looking at. EVA
- Token-Level Analysis Exposes Blind Spots in Multimodal CoT — Visual grounding tokens and reasoning tokens need very different optimization pressure. Uniform updates hurt both. Rethinking Token-Level Policy Optimization
- Diffusion Intermediate Representations Carry Built-In Degradation Awareness — Optical flow estimation that finally handles blur, noise, and compression artifacts. DA-Flow
- MLLM Decomposes Static Meshes Into Articulable Assets in One Step — Shortens the data production pipeline for embodied AI. SIMART
- 3D Engine Controls the Scene, Video Diffusion Adds Realistic Lighting — A fresh approach to the sim-to-real gap. RealMaster
- Sort RL Rollouts by Generation Length — Reduces padding waste. One simple scheduling trick gives meaningful training throughput gains. SortedRL
- Conditions for Synthetic Data to Break the RAG Ceiling — Not more data, but a hybrid training strategy. Synthetic Mixed Training
- Over-Fragmentation in Video Object Segmentation Gets a Clean Fix — Start from few coarse slots, refine progressively with reconstruction-guided curriculum. Reconstruction-Guided Slot Curriculum
- Multi-Model Routing Goes From Offline Selection to Online Bandit Learning — Dynamically balances quality and diversity. DAK-UCB
- Overlay Temporal Markers Directly on Video Frames as Visual Prompts — VideoLLMs understand temporal relations without dense sampling. ViKey
Don't miss what's next. Subscribe to AI Research Brief: