NVIDIA Packs Five Modalities Into One Set of Weights

        June 5, 2026

NVIDIA Packs Five Modalities Into One Set of Weights

NVIDIA Crams Language, Image, Video, Audio, and Action Into One Set of Weights. Cosmos 3 bets a single mixture-of-transformers can do every modality, and third parties rated it best open model in text-to-image, image-to-video, and robot policy.

The Same KV Quantization Looks Fine in Prefill and Falls Apart in Long Decoding. KVarN shows the error compounds across timesteps, uses variance normalization to tame outlier token-scales, and takes 2-bit KV quantization to a new SOTA — calibration-free, with a vLLM implementation.

Writing What You Learned In Context Back Into the Weights. "Language models need sleep" drops the metaphor: the mechanism is distillation plus self-rehearsal on synthetic data. But the abstract dodges the two hard questions — what to write back, and how to avoid forgetting.

Sampling Budget Goes From Hand-Tuned Threshold to Learned Policy. Framing "how many samples to draw" as an MDP, an RL-trained controller small enough to run on CPU beats strong baselines on the "fewer samples, no accuracy drop" tradeoff.

Also Notable

A Second KV-Cache Line the Same Day: Evict Instead of Quantize. Finds a few value states with abnormally large magnitude that can't be dropped, confirming outlier token-scale as the shared pain of long reasoning.
NVIDIA OmniDreams Runs Autonomous-Driving Closed-Loop Sim With a Real-Time Generative World Model. Targets the long-tail scenarios reconstruction-based simulators can't reach.
World Models and MLLMs Are Complementary, So Learn the Tradeoff Instead of Asking Which Wins. Judges when a visual rollout is trustworthy and when to discard it.
OVO-S-Bench Does Online Spatial Reasoning From a Continuous First-Person Stream. A layered benchmark that often needs evidence beyond the current field of view.
VSTAT Moves Video Understanding From Recognizing Isolated Moments to Tracking Entities and States. Aimed straight at the weak spot in MLLMs.
Wide-Baseline Matching as a Test Bed for Spatial Reasoning. Layered by viewpoint shift and matching granularity, it forces MLLMs to handle geometry and occlusion.
PaddleOCR-VL-1.6 Refines the Last Generation's Weak Regions Instead of Blindly Scaling Data. Does region-aware refinement.
Economy of Minds Uses Hayek's Decentralized Coordination to Let Agents Self-Organize by Bidding. Stronger collective intelligence emerges without central control.
AUDITFLOW Builds an Executable Symbolic Environment for Financial-Report Auditing. Lets agents link facts to taxonomic concepts, recompute expected values, then decide.
SynCred-Bench: AI Can Now Generate Images With Realistic Text and Layout, Creating a "Synthetic Credibility" Threat. A new kind of visual deception.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)