Drop 90% of Vision Tokens, Keep the Performance
- Spatial relationships in image generation can now be optimized, not just hoped for. SpatialScore trains a reward model that outperforms GPT-4V on spatial evaluation, then uses it to RL-fine-tune generators. CVPR accepted, dataset open-sourced.
- Masked image generation gets a 4x speedup with no quality loss. Learning feature dynamics replaces static caching, recovering semantic information that discrete sampling throws away.
- VLM quantization can't be one-size-fits-all. Vision and language tokens have different distributions. MoE-style dynamic error compensation routes each token type through separate repair paths. Works from 2B to 70B, CVPR accepted.
- 90% of vision tokens are compressible. HiDrop finds that early layers handle feature alignment and shouldn't be pruned. A layer-aware strategy that matches each layer's actual role is the key. ICLR accepted.
Also Notable
- Automated Caching for Diffusion Model Inference — sensitivity analysis decides which steps can be reused, turning cache strategy from hand-tuning to data-driven.
- Real-Time Multimodal Interaction With Simultaneous Speech and Vision Output — targets embodied agent scenarios where current systems handle only single-modality output.
- NVIDIA Bridges Neural Reconstruction and Photorealistic Simulation — online diffusion enhancement brings neural-reconstruction-based simulators closer to real sensor quality for autonomous driving.
- Zero-Shot Scene-Aware Object Replacement Without Per-Object Fine-Tuning — initial noise perturbation handles object swaps while preserving scene coherence.
- Static Benchmarks Can't Keep Up With Model Evolution — agent-driven dynamic evaluation protocol evolves test items alongside model capabilities.
- Reflective RL for Emotional Reasoning in MLLMs — addresses poor generalization of SFT on affective understanding tasks through chain-of-thought reflection.
- Unified Token Pruning for Visual Tracking — prunes both template and search region tokens simultaneously for real-time deployment.
- Interpretable Debiasing for VLMs via Reasoning Chain Intervention — moves bias correction from black-box post-processing to transparent, auditable reasoning-level fixes.
- Dynamic Retrieval and Topological Constraints for Dataset Distillation — breaks the diversity bottleneck of static anchor points, improving synthetic dataset representativeness.
- First Benchmark for Small-Object Editing in Instruction-Based Models — fills a blind spot in existing evaluations that miss fine-grained editing capability.
Don't miss what's next. Subscribe to AI Research Brief: