Diffusion OCR Decodes 3.2x Faster, Single-Stream AV in 2 Seconds
- Diffusion Decoding Replaces Autoregressive OCR, Going From Serial to Parallel. MinerU-Diffusion reframes document parsing as inverse rendering, using block-wise diffusion to generate structured source in parallel. 3.2x faster decoding, open-source.
- RLVR Update Direction Matters More Than Magnitude. The sign of token-level Δlog p pinpoints sparse, reasoning-critical updates more precisely than magnitude metrics. Two resulting methods need no architecture changes.
- Multi-Task SFT Wastes Compute You Don't See. Sub-datasets overfit at wildly different rates. mSFT iteratively drops the earliest overfitters, cutting FLOPs and improving results under low budgets.
- Video GRPO Instability Traced to Off-Manifold Exploration Noise. The ODE-to-SDE switch pushes sampling trajectories off the pretrained data manifold. SAGE-GRPO fixes this with manifold-projected exploration and dual trust regions, validated on HunyuanVideo.
- Joint Audio-Video Generation Doesn't Need Multi-Stream Architecture. Text, video, and audio tokens in a single sequence with plain self-attention. 5-second video in 2 seconds on one H100, full model stack open-sourced.
Also Notable
- World Model Evaluation Shifts From Visual Fidelity to 4D Interaction — A new evaluation paradigm centered on physics consistency and controllability. Omni-WorldBench
- LLM Agent Workflows: Static Templates to Dynamic Runtime Graphs — Systematic survey organized by "when is structure determined," directly useful for architecture decisions. From Static Templates to Dynamic Runtime Graphs
- Inject 3D Spatial Awareness Without Touching the Vision Encoder — Language-guided reasoning extracts overlooked spatial understanding from 2D pretrained representations. SpatialBoost
- Repurpose Geometric Foundation Model Features as Diffusion Latent Space — Multi-view geometric consistency built in, not post-processed. Repurposing Geometric Foundation Models
- A New Fix for Recursive Self-Improvement Drift — Symbolic verification as anchors stabilizes reasoning chain quality across DPO iterations. Symbolic Recursive Self-Alignment
- Unified Spatiotemporal Token Compression for Video LLMs — Maintains performance at ultra-low retention rates, more efficient than staged pruning. Unified Spatiotemporal Token Compression
- Teach Speech Models to Respect Duration Constraints — A hard requirement for voice assistant deployment; MIT open-sources a post-training approach. TiCo
- Continual Unlearning for Multimodal LLMs — Selectively refuse under sequential deletion requests without destroying shared representations. Continual Unlearning for LVLMs
- Unify Three Hand-Object Interaction Tracks Into One Sim-to-Real Framework — Pose, appearance, and motion generation in a single pipeline. PAM
- Test-Time Scaling for Image Restoration — Adapt flow matching models to degradation types at inference without modifying pretrained weights. Tuning Real-World IR at Inference
Don't miss what's next. Subscribe to AI Research Brief: