LLM-Initialized Vision Encoders Outperform Larger Models at 2B
- An LLM-Initialized Vision Encoder at 2B Beats Larger Models on Multiple Benchmarks. Contrastive pretraining optimizes for coarse-grained matching; VLMs need fine-grained understanding. Changing the starting point beats adding parameters.
- Skip the Search, Use Block Statistics to Locate Sparse Attention Patterns. 27x speedup at 256K sequence length, 1.7x even at 4K. Code is open-sourced.
- A Physics Simulator Embedded in the Diffusion Loop Guides Video Generation With Simulated Trajectories. This competes directly with RealWonder's "bypass physics" approach. CVPR accepted.
- Model Merging Breaks Down Because Task Vectors Drift in Direction. DC-Merge fixes directional consistency via energy balancing and orthogonal projection. Works for both full fine-tuning and LoRA. CVPR accepted.
- DiT Decides Where to Allocate More Tokens and Where to Compress. Adaptive token allocation across both spatial and temporal dimensions, fine-tunable from existing checkpoints.
Also Notable
- Complete 3D Indoor Scene Mesh From a Single RGB Image — One forward pass, no post-processing optimization.
- Black-Box Backdoor Detection for T2I Models — Via instruction-response deviation, not image similarity.
- Understanding as Intrinsic Reward to Improve Generation Quality — A new training signal for unified multimodal models.
- Training-Free Diffusion Segmenters Scale With the Underlying Generator — Stronger generation means more accurate segmentation.
- Domain-Label-Free Multimodal Summarization — Decomposes video structure through event chains.
- Cross-Modal CoT Reasoning for Tumor Analysis in Medical Imaging — Each step traceable to specific imaging evidence.
- Signal Processing View of SGD Momentum — Finds exploitable frequency structure in gradients.
- Feed-Forward 360° 3D Scene From a Single Panorama — Compositional generation, no iterative layout optimization.
- Change Descriptions That Model the Process, Not Just the Outcome — Captures intermediate change dynamics, not only final differences.
Don't miss what's next. Subscribe to AI Research Brief: