Tri-Modal Training From Scratch, Agentic RL Gets a Stability Fix

        February 27, 2026

Tri-Modal Training From Scratch, Agentic RL Gets a Stability Fix

Apple trains a tri-modal masked diffusion model from scratch. Systematic testing of scaling laws, modality mixing, and noise schedules makes this directly actionable for teams working on multi-modal diffusion. Masked diffusion is emerging as a real alternative to autoregressive generation.

Agentic RL training collapse now has a diagnostic framework. ARLArena decomposes policy gradients into four design dimensions and ablates each one to find instability sources. More effective than swapping algorithms blindly.

SkyReels-V4 generates video and audio together via dual-stream MMDiT. Text-to-video, inpainting, and editing all collapse into a single interface. Unified architectures are absorbing standalone modality pipelines.

Adding CoT reasoning to GUI agents actually hurts grounding. GUI-Libra identifies the root causes: action token dilution and incomplete step-level verification. Targeted fixes follow.

World models go multi-player and multi-viewpoint. Solaris achieves consistent multi-agent simulation in Minecraft. The automated data collection system may outlast the model itself.

Also Notable

Conditional Guidance Scheduling Fixes Multi-GPU Diffusion Artifacts — A hybrid data-pipeline parallel framework that improves multi-GPU scaling without sacrificing generation quality.
Image Editing Learns Physical Causality — Latent transition priors model refraction, deformation, and other dynamics so edits follow physics, not just pixel manipulation.
VLA World Models Don't Need to Predict Pixels — Maps future observations to a compact conditional space instead of full frames. Preserves fine-grained information at lower compute cost.
Multi-Modal LM Generates SVG Glyphs Autoregressively — Bypasses the traditional rasterize-then-vectorize pipeline for end-to-end high-quality font generation.
Dissecting the Text Ranking Behind Deep Research — Search APIs aren't black boxes. Ranking component effectiveness and failure modes directly affect research quality.
Register Tokens Can't Fix ViT Artifacts — The fix needs to come from the attention mechanism itself. CVPR paper.
Fully Feed-Forward 3D Editing Without Per-Scene Optimization — A Rectified Voxel Flow approach on top of TRELLIS. CVPR paper.
Gene Expression Prediction Benefits More From Multi-Modal Signals Than Longer Input Sequences — Adding histone modifications outperforms simply concatenating longer DNA sequences. ICLR paper.
Machine Unlearning Is Unreliable Under Biased Data — Models learn shortcuts that render "forgetting" operations ineffective. AAAI paper.
Fixing RNN Recurrent Poles in Place Works Better — Not training the poles makes online learning more stable. Another case where fewer trained parameters wins.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)