Drop 90% of Vision Tokens, Keep the Performance

        March 2, 2026

Drop 90% of Vision Tokens, Keep the Performance

Spatial relationships in image generation can now be optimized, not just hoped for. SpatialScore trains a reward model that outperforms GPT-4V on spatial evaluation, then uses it to RL-fine-tune generators. CVPR accepted, dataset open-sourced.

Masked image generation gets a 4x speedup with no quality loss. Learning feature dynamics replaces static caching, recovering semantic information that discrete sampling throws away.

VLM quantization can't be one-size-fits-all. Vision and language tokens have different distributions. MoE-style dynamic error compensation routes each token type through separate repair paths. Works from 2B to 70B, CVPR accepted.

90% of vision tokens are compressible. HiDrop finds that early layers handle feature alignment and shouldn't be pruned. A layer-aware strategy that matches each layer's actual role is the key. ICLR accepted.

Also Notable

Automated Caching for Diffusion Model Inference — sensitivity analysis decides which steps can be reused, turning cache strategy from hand-tuning to data-driven.
Real-Time Multimodal Interaction With Simultaneous Speech and Vision Output — targets embodied agent scenarios where current systems handle only single-modality output.
NVIDIA Bridges Neural Reconstruction and Photorealistic Simulation — online diffusion enhancement brings neural-reconstruction-based simulators closer to real sensor quality for autonomous driving.
Zero-Shot Scene-Aware Object Replacement Without Per-Object Fine-Tuning — initial noise perturbation handles object swaps while preserving scene coherence.
Static Benchmarks Can't Keep Up With Model Evolution — agent-driven dynamic evaluation protocol evolves test items alongside model capabilities.
Reflective RL for Emotional Reasoning in MLLMs — addresses poor generalization of SFT on affective understanding tasks through chain-of-thought reflection.
Unified Token Pruning for Visual Tracking — prunes both template and search region tokens simultaneously for real-time deployment.
Interpretable Debiasing for VLMs via Reasoning Chain Intervention — moves bias correction from black-box post-processing to transparent, auditable reasoning-level fixes.
Dynamic Retrieval and Topological Constraints for Dataset Distillation — breaks the diversity bottleneck of static anchor points, improving synthetic dataset representativeness.
First Benchmark for Small-Object Editing in Instruction-Based Models — fills a blind spot in existing evaluations that miss fine-grained editing capability.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)