FD as Loss: One-Step Generation Hits 0.72 FID

        May 2, 2026

FD as Loss: One-Step Generation Hits 0.72 FID

Heterogeneous scientific foundation model collaboration: Eywa pulls LLMs back from "general solver" to coordinator, handing protein structure and physics simulation tasks to domain-specialized predictors.

FD estimation decoupled from gradient batches: Fréchet Distance, stuck as an evaluation metric for years, becomes a real training loss. One-step generation hits 0.72 FID on ImageNet 256 in post-training.

Ambiguous instructions plus an interactive action space: InteractWeb-Bench makes "actively clarify intent" a required capability, and frontier multimodal agents fall into blind execution under it.

Production-grade "world as the agent sees it": Synthetic Computers at Scale builds 1000 user-specific computers with 8+ hour simulations, shifting the long-horizon training bottleneck from trajectory generation to environment synthesis.

Also Notable

Five-Level Visual Generation Taxonomy Pushes Direction From Atomic Appearance Mapping to Agentic World Modeling — a framework, not a new model. Value is in redrawing the lanes.
Research Infrastructure Upgrades From Citation Graphs to Explicit Method Evolution Graphs — Intern-Atlas is built as backbone for AI scientist systems.
Skeleton-Agnostic End-to-End Mocap Skips Non-Differentiable IK — MoCapAnything V2 predicts joint rotations directly, so noisy video-to-pose isn't gated by a middle layer.
3D Semantic Occupancy Turns Real Scenes Into Structured Minecraft Environments — run VLN and other embodied tasks with the game engine as the simulator.
Continuous, Interpretable Physics Priors for Video Diffusion — targets non-drifting objects and more honest collisions. A concrete patch on the PhyWorld line.
GRPO Moves Into Latent Space — first attempt at running RL over implicit reasoning chains.
High-Concurrency Code Sandbox for LLM Code RL Training and Evaluation — ScaleBox prioritizes high-fidelity verification over "it ran."
Causal Intervention Cuts Reward Models' Dependency on Response Length — more systematic than length normalization.
OpenAI Releases Evaluation Set From Real Clinician ChatGPT Conversations — medical LLM evaluation moves from mock questions to real workflow scenarios.
1084 Expert-Curated Scientific Experiment Figures, 4264 QAs — SPUR targets fine-grained panel-level perception and reasoning.
Forensic Benchmark for AI-Generated Academic Figures, 7 Categories and 39 Subclasses — AEGIS pushes academic fraud detection into fine-grained evaluation.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)