3D at 0.1% Tokens, Video Fine-Tuning's Hidden Spatial Cost
- Misaligned experience replay is a silent bottleneck in agent RL. Complementary RL lets the experience extractor adapt based on policy performance, enabling co-evolution instead of static accumulation.
- Video-SFT's temporal gains come at the cost of spatial understanding. Systematic experiments across architectures and scales confirm this is a structural trade-off, not a model-specific bug.
- Video generation as auxiliary supervision for robot policies, disabled at deployment. GigaWorld-Policy's decoupled design runs 9x faster than Motus with 7% higher success rate.
- 3D tokenization shifts from geometric to semantic hierarchy. LoST achieves better reconstruction quality with 0.1% of prior methods' token count.
- Token pruning across ViT and LLM finally unified. STTS is end-to-end trainable, with efficiency gains that scale as you sample more frames for long videos.
Also Notable
- Discrete Audio Tokens at 12.5fps for Autoregressive Speech Generation. A complete, deployable open-source foundation model pipeline. MOSS-TTS
- Joint Appearance and Stereo Geometry from Pure RGB. End-to-end stereo video generation without depth maps. StereoWorld
- MLLMs Expose Systematic Hallucination on Fine-Grained Negation. Passable coarse-grained scores mask failure modes at finer granularity. FINER
- Hierarchical Grids Compress Long Video Navigation to Log-Scale Compute. Operates directly on raw frames, no caption preprocessing needed. VideoAtlas
- Majority Voting in Label-Free RL Collapses Output Diversity. Co-evolving generator and verifier breaks the consensus trap. CoVerRL
- RL-Trained Code Search Agent for Precise Large-Repo Localization. The upstream bottleneck for coding agents: get localization wrong and everything downstream fails. CodeScout
- Integrated Gradients Guide Layer-Wise Mixed-Precision LVLM Quantization. Quantization sensitivity visualization directly cuts deployment cost. QAIG
- GQA-to-MLA via Low-Rank Factorization. Drops KV-cache overhead without retraining. CARE
- A New Fix for DPO's Squeezing Effect. Sharpness-aware optimization in logit space balances alignment and generalization. LogitSAM
- Adaptive Zoom and Instruction Refinement for Small GUI Elements. A practical GRPO-trained approach. AdaZoom-GUI
Don't miss what's next. Subscribe to AI Research Brief: