3D at 0.1% Tokens, Video Fine-Tuning's Hidden Spatial Cost

        March 20, 2026

3D at 0.1% Tokens, Video Fine-Tuning's Hidden Spatial Cost

Misaligned experience replay is a silent bottleneck in agent RL. Complementary RL lets the experience extractor adapt based on policy performance, enabling co-evolution instead of static accumulation.

Video-SFT's temporal gains come at the cost of spatial understanding. Systematic experiments across architectures and scales confirm this is a structural trade-off, not a model-specific bug.

Video generation as auxiliary supervision for robot policies, disabled at deployment. GigaWorld-Policy's decoupled design runs 9x faster than Motus with 7% higher success rate.

3D tokenization shifts from geometric to semantic hierarchy. LoST achieves better reconstruction quality with 0.1% of prior methods' token count.

Token pruning across ViT and LLM finally unified. STTS is end-to-end trainable, with efficiency gains that scale as you sample more frames for long videos.

Also Notable

Discrete Audio Tokens at 12.5fps for Autoregressive Speech Generation. A complete, deployable open-source foundation model pipeline. MOSS-TTS
Joint Appearance and Stereo Geometry from Pure RGB. End-to-end stereo video generation without depth maps. StereoWorld
MLLMs Expose Systematic Hallucination on Fine-Grained Negation. Passable coarse-grained scores mask failure modes at finer granularity. FINER
Hierarchical Grids Compress Long Video Navigation to Log-Scale Compute. Operates directly on raw frames, no caption preprocessing needed. VideoAtlas
Majority Voting in Label-Free RL Collapses Output Diversity. Co-evolving generator and verifier breaks the consensus trap. CoVerRL
RL-Trained Code Search Agent for Precise Large-Repo Localization. The upstream bottleneck for coding agents: get localization wrong and everything downstream fails. CodeScout
Integrated Gradients Guide Layer-Wise Mixed-Precision LVLM Quantization. Quantization sensitivity visualization directly cuts deployment cost. QAIG
GQA-to-MLA via Low-Rank Factorization. Drops KV-cache overhead without retraining. CARE
A New Fix for DPO's Squeezing Effect. Sharpness-aware optimization in logit space balances alignment and generalization. LogitSAM
Adaptive Zoom and Instruction Refinement for Small GUI Elements. A practical GRPO-trained approach. AdaZoom-GUI

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)