Drop CLIP, Gain Performance: VLMs Work Better Without It

        March 9, 2026

Drop CLIP, Gain Performance: VLMs Work Better Without It

Contrastive Pretraining Actively Hurts VLMs. CLIP optimizes for category discrimination, not fine-grained understanding. Tencent's Penguin-VL initializes the vision encoder from a text-only LLM, beating CLIP/SigLIP alternatives at 2B and 8B scale.

Sparse Attention's Bottleneck Shifts from "How to Sparsify" to "How to Discover." FlashPrefill shows that attention sparsity patterns can be identified at near-zero cost. 28x speedup on 256K sequences, no degradation at 4K.

Model Merging Failures Now Have a Quantifiable Diagnostic. DC-Merge finds that direction deviation of task vectors in singular space directly predicts knowledge loss. Fixing directional consistency systematically improves merge quality.

Diffusion Models Learn to Allocate Compute by Information Density. DC-DiT gives high-detail regions more tokens and compresses low-information areas. The allocation strategy adapts across denoising stages and warm-starts from existing DiT checkpoints.

Also Notable

PSIVG Embeds a Physics Simulator Directly into the Diffusion Loop. The opposite of RealWonder's "physics engine and video model work separately" approach from a few days ago. Two competing architectures for the same problem.
Single RGB Image to Full 3D Indoor Scene Mesh, Autoregressively. Skips SDF intermediate representation and post-optimization. Direct mesh output.
Black-Box Backdoor Detection for T2I Models. Measures deviation between instruction and response instead of comparing generated image similarity.
Understanding Ability Bootstraps Generation Quality. In unified multimodal models, the understanding module provides intrinsic reward signals to guide T2I generation.
Diffusion Models Do Semantic Segmentation with Zero Training. Segmentation ability scales up alongside generation ability.
Training-Free Multimodal Summarization via Event Chain Fusion. Avoids information loss from implicit fusion by structuring through event chains.
Multimodal Chain-of-Thought Reasoning for Tumor Analysis. Outputs traceable reasoning chains, not just diagnostic conclusions.
Rethinking SGD Momentum Through Signal Processing. Frequency-domain properties of gradients reveal why certain momentum settings work.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)