AI Research Brief

Archives
March 9, 2026

Drop CLIP, Gain Performance: VLMs Work Better Without It

  • Contrastive Pretraining Actively Hurts VLMs. CLIP optimizes for category discrimination, not fine-grained understanding. Tencent's Penguin-VL initializes the vision encoder from a text-only LLM, beating CLIP/SigLIP alternatives at 2B and 8B scale.
  • Sparse Attention's Bottleneck Shifts from "How to Sparsify" to "How to Discover." FlashPrefill shows that attention sparsity patterns can be identified at near-zero cost. 28x speedup on 256K sequences, no degradation at 4K.
  • Model Merging Failures Now Have a Quantifiable Diagnostic. DC-Merge finds that direction deviation of task vectors in singular space directly predicts knowledge loss. Fixing directional consistency systematically improves merge quality.
  • Diffusion Models Learn to Allocate Compute by Information Density. DC-DiT gives high-detail regions more tokens and compresses low-information areas. The allocation strategy adapts across denoising stages and warm-starts from existing DiT checkpoints.

Also Notable

  • PSIVG Embeds a Physics Simulator Directly into the Diffusion Loop. The opposite of RealWonder's "physics engine and video model work separately" approach from a few days ago. Two competing architectures for the same problem.
  • Single RGB Image to Full 3D Indoor Scene Mesh, Autoregressively. Skips SDF intermediate representation and post-optimization. Direct mesh output.
  • Black-Box Backdoor Detection for T2I Models. Measures deviation between instruction and response instead of comparing generated image similarity.
  • Understanding Ability Bootstraps Generation Quality. In unified multimodal models, the understanding module provides intrinsic reward signals to guide T2I generation.
  • Diffusion Models Do Semantic Segmentation with Zero Training. Segmentation ability scales up alongside generation ability.
  • Training-Free Multimodal Summarization via Event Chain Fusion. Avoids information loss from implicit fusion by structuring through event chains.
  • Multimodal Chain-of-Thought Reasoning for Tumor Analysis. Outputs traceable reasoning chains, not just diagnostic conclusions.
  • Rethinking SGD Momentum Through Signal Processing. Frequency-domain properties of gradients reveal why certain momentum settings work.

Read the full edition →

Don't miss what's next. Subscribe to AI Research Brief:
Powered by Buttondown, the easiest way to start and grow your newsletter.