3B Params Win Three Olympiad Golds, 768-D Discrete Tokens Work
- Cascade RL plus multi-domain distillation lets 3B active parameters win three olympiad golds. NVIDIA open-sourced the full training recipe. Small-model reasoning ceilings just moved.
- Video diffusion models already encode full 3D spatial priors internally. No 3D annotations or geometry modules needed. Extract intermediate features and you get depth and scene flow prediction.
- 768-dimensional discrete tokens serve both understanding and generation. CubiD uses fine-grained masked diffusion to sidestep high-dimensional combinatorial explosion. One fewer barrier to unified multimodal architectures.
- Reaction latency, not trajectory smoothness, is the real VLA deployment bottleneck. FASTER provides an explicit formula and compresses reactive denoising by roughly 10x.
- Agents that build and iterate their own skills outperform external skill injection. But percentage gains on extremely low baselines deserve a sober second look.
Also Notable
- Semantic Editing and Motion Preservation No Longer Fight Each Other. SAMA decouples the two objectives into independent optimization paths without external priors.
- 3DreamBooth Uses Multi-View 3D Representations for Subject-Driven Video Generation. View consistency stops being luck-dependent; objects are no longer treated as 2D.
- Long Video + Audio Cross-Modal Understanding Gets a Systematic Benchmark. Current OmniLLMs collapse on cross-modal tasks beyond 10 minutes.
- Diffusion for Discrete Motion Tokens. Handles semantic conditioning and kinematic constraints simultaneously, merging two previously incompatible motion generation paradigms.
- Video Diffusion Denoising Steps Vary Wildly in Precision Sensitivity. Step-level adaptive quantization pushes models down to 6-bit.
- RL Alignment for Diffusion Language Models Requires Full Diffusion Probability per Step. Meta uses trajectory reduction to slash the overhead.
- Procedural Diagnostic Environments Isolate Reasoning-Action Coupling in Tool-Augmented LLMs. Eliminates memorization and data contamination confounds. From CMU.
- When Should a Generalist Model Split into Domain Experts? EPFL provides an optimal splitting strategy that outperforms one-size-fits-all fine-tuning.
- Cross-Domain Video Demonstrations to Executable Code. Neuro-symbolic counterfactual reasoning auto-adapts to perceptual differences across physical environments.
- Single-Image Reconstruction of Articulated 3D Objects. Progressive structural reasoning decomposes geometry, parts, and motion parameters layer by layer.
Don't miss what's next. Subscribe to AI Research Brief: