Diffusion OCR Decodes 3.2x Faster, Single-Stream AV in 2 Seconds

        March 25, 2026

Diffusion OCR Decodes 3.2x Faster, Single-Stream AV in 2 Seconds

Diffusion Decoding Replaces Autoregressive OCR, Going From Serial to Parallel. MinerU-Diffusion reframes document parsing as inverse rendering, using block-wise diffusion to generate structured source in parallel. 3.2x faster decoding, open-source.

RLVR Update Direction Matters More Than Magnitude. The sign of token-level Δlog p pinpoints sparse, reasoning-critical updates more precisely than magnitude metrics. Two resulting methods need no architecture changes.

Multi-Task SFT Wastes Compute You Don't See. Sub-datasets overfit at wildly different rates. mSFT iteratively drops the earliest overfitters, cutting FLOPs and improving results under low budgets.

Video GRPO Instability Traced to Off-Manifold Exploration Noise. The ODE-to-SDE switch pushes sampling trajectories off the pretrained data manifold. SAGE-GRPO fixes this with manifold-projected exploration and dual trust regions, validated on HunyuanVideo.

Joint Audio-Video Generation Doesn't Need Multi-Stream Architecture. Text, video, and audio tokens in a single sequence with plain self-attention. 5-second video in 2 seconds on one H100, full model stack open-sourced.

Also Notable

World Model Evaluation Shifts From Visual Fidelity to 4D Interaction — A new evaluation paradigm centered on physics consistency and controllability. Omni-WorldBench
LLM Agent Workflows: Static Templates to Dynamic Runtime Graphs — Systematic survey organized by "when is structure determined," directly useful for architecture decisions. From Static Templates to Dynamic Runtime Graphs
Inject 3D Spatial Awareness Without Touching the Vision Encoder — Language-guided reasoning extracts overlooked spatial understanding from 2D pretrained representations. SpatialBoost
Repurpose Geometric Foundation Model Features as Diffusion Latent Space — Multi-view geometric consistency built in, not post-processed. Repurposing Geometric Foundation Models
A New Fix for Recursive Self-Improvement Drift — Symbolic verification as anchors stabilizes reasoning chain quality across DPO iterations. Symbolic Recursive Self-Alignment
Unified Spatiotemporal Token Compression for Video LLMs — Maintains performance at ultra-low retention rates, more efficient than staged pruning. Unified Spatiotemporal Token Compression
Teach Speech Models to Respect Duration Constraints — A hard requirement for voice assistant deployment; MIT open-sources a post-training approach. TiCo
Continual Unlearning for Multimodal LLMs — Selectively refuse under sequential deletion requests without destroying shared representations. Continual Unlearning for LVLMs
Unify Three Hand-Object Interaction Tracks Into One Sim-to-Real Framework — Pose, appearance, and motion generation in a single pipeline. PAM
Test-Time Scaling for Image Restoration — Adapt flow matching models to degradation types at inference without modifying pretrained weights. Tuning Real-World IR at Inference

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)