AI Research Brief

Archives
March 2, 2026

Drop 90% of Vision Tokens, Keep the Performance

  • Spatial relationships in image generation can now be optimized, not just hoped for. SpatialScore trains a reward model that outperforms GPT-4V on spatial evaluation, then uses it to RL-fine-tune generators. CVPR accepted, dataset open-sourced.
  • Masked image generation gets a 4x speedup with no quality loss. Learning feature dynamics replaces static caching, recovering semantic information that discrete sampling throws away.
  • VLM quantization can't be one-size-fits-all. Vision and language tokens have different distributions. MoE-style dynamic error compensation routes each token type through separate repair paths. Works from 2B to 70B, CVPR accepted.
  • 90% of vision tokens are compressible. HiDrop finds that early layers handle feature alignment and shouldn't be pruned. A layer-aware strategy that matches each layer's actual role is the key. ICLR accepted.

Also Notable

  • Automated Caching for Diffusion Model Inference — sensitivity analysis decides which steps can be reused, turning cache strategy from hand-tuning to data-driven.
  • Real-Time Multimodal Interaction With Simultaneous Speech and Vision Output — targets embodied agent scenarios where current systems handle only single-modality output.
  • NVIDIA Bridges Neural Reconstruction and Photorealistic Simulation — online diffusion enhancement brings neural-reconstruction-based simulators closer to real sensor quality for autonomous driving.
  • Zero-Shot Scene-Aware Object Replacement Without Per-Object Fine-Tuning — initial noise perturbation handles object swaps while preserving scene coherence.
  • Static Benchmarks Can't Keep Up With Model Evolution — agent-driven dynamic evaluation protocol evolves test items alongside model capabilities.
  • Reflective RL for Emotional Reasoning in MLLMs — addresses poor generalization of SFT on affective understanding tasks through chain-of-thought reflection.
  • Unified Token Pruning for Visual Tracking — prunes both template and search region tokens simultaneously for real-time deployment.
  • Interpretable Debiasing for VLMs via Reasoning Chain Intervention — moves bias correction from black-box post-processing to transparent, auditable reasoning-level fixes.
  • Dynamic Retrieval and Topological Constraints for Dataset Distillation — breaks the diversity bottleneck of static anchor points, improving synthetic dataset representativeness.
  • First Benchmark for Small-Object Editing in Instruction-Based Models — fills a blind spot in existing evaluations that miss fine-grained editing capability.

Read the full edition →

Don't miss what's next. Subscribe to AI Research Brief:
Powered by Buttondown, the easiest way to start and grow your newsletter.