ViT Pre-Trains Like an LLM, Skips the CLIP Stage

        May 4, 2026

ViT Pre-Trains Like an LLM, Skips the CLIP Stage

GenLIP Pre-Trains ViT With an LM Objective Directly: dropping CLIP's contrastive stage and text decoder, 8B samples match larger-data baselines on multimodal benchmarks, and multi-resolution continuation lifts OCR and chart understanding.

UniVidX Runs Multiple Pixel-Aligned Video Tasks Off One VDM Prior: SCM plus per-modality Gated LoRA route intrinsic decomposition and RGBA layering through the same framework, with fewer than 1000 videos matching dedicated methods.

Themis Adds Multi-Criteria, Multi-Language Scoring to Code RMs: profiling shows existing RMs fail at almost everything outside functional correctness, and 350K+ preference pairs train an open 600M-to-32B series.

Image Jailbreaks Hit VLMs at 40.9%, Text Versions Only 10.7%: four image-encoded attack patterns work as drop-in red-team scripts, but encoding bypasses' staying power depends on visual moderation re-tests.

Also Notable

Tokenizer No Longer Trains Independently — supervised end-to-end by generation loss, rewriting the autoregressive image modeling pipeline.
RLVR's Over-Incentive on Positive Rewards Collapses Diversity — negative-sample projection residuals compensate.
LLM Mode Collapse Reinterpreted Through Dynamical Systems — geometric regularization gives a lightweight fix.
GUI Agent Accessibility Trees Are Redundant and Unstructured — observation refactor cuts token cost directly.
Text-to-3D World Generation Uses Segment Maps as Layout Conditions — bypasses grid layout and cross-object scale inconsistency.
Multi-Agent MCTS Joint Action Space Explodes — surrogate-guided exploration pulls search budget back to feasible.
Mesh Physics Topology and Metric Structures Modeled Separately — port-Hamiltonian gives a structure-preserving neural implementation.
Bayesian Costly, Ensembles High Variance — possibility theory adds a third option for epistemic uncertainty.
Pathology Federated Learning Heterogeneity Comes From MIL Architecture and Feature Extractor Mismatch — Gaussian mixture feature alignment plus curriculum integration.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)