Cross-Architecture Distillation Shrinks dLLMs to 0.6B

        May 1, 2026

Cross-Architecture Distillation Shrinks dLLMs to 0.6B

Cross-Architecture Distillation Shrinks dLLMs From 8B to 0.6B. TIDE is the first dLLM distillation framework where teacher and student differ in architecture, attention mechanism, and tokenizer at once. HumanEval jumps from 32.3 to 48.78, with average gains of 1.53 points across 8 benchmarks.

Agent Training Data Synthesis Is Becoming the New Infrastructure Layer. ClawGym ships 13.5K persona-driven tasks, simulated workspaces, and hybrid verification. 43 HF upvotes top every highlight today.

Speculative Decoding Becomes a Lossless Speedup Primitive for RL Rollouts. 1.8x throughput at 8B, 2.5x simulated end-to-end at 235B with an async pipeline. On-policy purity stays intact.

Asynchronous Denoising Lets Action and Video Run on Different Cadences in One Diffusion. X-WAM pretrains on 5,800 hours of robot data. RoboCasa hits 79.2% success, RoboTwin 2.0 hits 90.7%.

Also Notable

Storing Long-Horizon Agent Trajectories as Images for OCR Recall. Sidesteps text context budgets and adds a non-text memory channel for 100+ turn interaction histories.
AAAI Empirically Challenges a Popular Assumption About Neuro-Symbolic. Splitting grounding from compositionality reveals the former doesn't imply the latter.
DiT Feature Cache Forecast Swaps Hand-Tuned Formulas for Learned Linear Predictors. Aggressive step-skipping doesn't lose speed. Fixed formulas can't adapt to shifting distributions.
Virtual Persona Dialogue Eval Tests Strategic Memory Use, Not Just Recall. StratMem-Bench separates "remembered" from "used well." Long-conversation product teams should pay attention.
3D Gaussian Splatting Interaction Solved Through Semantic Decomposition. Semantic Foam unifies spatial and semantic scene decomposition, filling a gap for interactive graphics applications.
Causal Bases Constrain VFMs for Single-Source Domain Generalization. Avoids two typical confounders, illumination and co-occurrence. Detector transfer from source to target gets steadier.
Weakly Supervised Action Segmentation Uses HOI-Aware Adaptive Networks. AdaAct stops labeling each frame with a fixed network and tunes parameters by HOI context instead.
Federated Domain Generalization Re-ID: Semantic Anchoring and Style Diversification Feed Each Other. CO-EVO links two paths that used to run independently. FedDG-ReID stops being a binary choice.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)