Tri-Modal Training From Scratch, Agentic RL Gets a Stability Fix
- Apple trains a tri-modal masked diffusion model from scratch. Systematic testing of scaling laws, modality mixing, and noise schedules makes this directly actionable for teams working on multi-modal diffusion. Masked diffusion is emerging as a real alternative to autoregressive generation.
- Agentic RL training collapse now has a diagnostic framework. ARLArena decomposes policy gradients into four design dimensions and ablates each one to find instability sources. More effective than swapping algorithms blindly.
- SkyReels-V4 generates video and audio together via dual-stream MMDiT. Text-to-video, inpainting, and editing all collapse into a single interface. Unified architectures are absorbing standalone modality pipelines.
- Adding CoT reasoning to GUI agents actually hurts grounding. GUI-Libra identifies the root causes: action token dilution and incomplete step-level verification. Targeted fixes follow.
- World models go multi-player and multi-viewpoint. Solaris achieves consistent multi-agent simulation in Minecraft. The automated data collection system may outlast the model itself.
Also Notable
- Conditional Guidance Scheduling Fixes Multi-GPU Diffusion Artifacts — A hybrid data-pipeline parallel framework that improves multi-GPU scaling without sacrificing generation quality.
- Image Editing Learns Physical Causality — Latent transition priors model refraction, deformation, and other dynamics so edits follow physics, not just pixel manipulation.
- VLA World Models Don't Need to Predict Pixels — Maps future observations to a compact conditional space instead of full frames. Preserves fine-grained information at lower compute cost.
- Multi-Modal LM Generates SVG Glyphs Autoregressively — Bypasses the traditional rasterize-then-vectorize pipeline for end-to-end high-quality font generation.
- Dissecting the Text Ranking Behind Deep Research — Search APIs aren't black boxes. Ranking component effectiveness and failure modes directly affect research quality.
- Register Tokens Can't Fix ViT Artifacts — The fix needs to come from the attention mechanism itself. CVPR paper.
- Fully Feed-Forward 3D Editing Without Per-Scene Optimization — A Rectified Voxel Flow approach on top of TRELLIS. CVPR paper.
- Gene Expression Prediction Benefits More From Multi-Modal Signals Than Longer Input Sequences — Adding histone modifications outperforms simply concatenating longer DNA sequences. ICLR paper.
- Machine Unlearning Is Unreliable Under Biased Data — Models learn shortcuts that render "forgetting" operations ineffective. AAAI paper.
- Fixing RNN Recurrent Poles in Place Works Better — Not training the poles makes online learning more stable. Another case where fewer trained parameters wins.
Don't miss what's next. Subscribe to AI Research Brief: