FD as Loss: One-Step Generation Hits 0.72 FID
- Heterogeneous scientific foundation model collaboration: Eywa pulls LLMs back from "general solver" to coordinator, handing protein structure and physics simulation tasks to domain-specialized predictors.
- FD estimation decoupled from gradient batches: Fréchet Distance, stuck as an evaluation metric for years, becomes a real training loss. One-step generation hits 0.72 FID on ImageNet 256 in post-training.
- Ambiguous instructions plus an interactive action space: InteractWeb-Bench makes "actively clarify intent" a required capability, and frontier multimodal agents fall into blind execution under it.
- Production-grade "world as the agent sees it": Synthetic Computers at Scale builds 1000 user-specific computers with 8+ hour simulations, shifting the long-horizon training bottleneck from trajectory generation to environment synthesis.
Also Notable
- Five-Level Visual Generation Taxonomy Pushes Direction From Atomic Appearance Mapping to Agentic World Modeling — a framework, not a new model. Value is in redrawing the lanes.
- Research Infrastructure Upgrades From Citation Graphs to Explicit Method Evolution Graphs — Intern-Atlas is built as backbone for AI scientist systems.
- Skeleton-Agnostic End-to-End Mocap Skips Non-Differentiable IK — MoCapAnything V2 predicts joint rotations directly, so noisy video-to-pose isn't gated by a middle layer.
- 3D Semantic Occupancy Turns Real Scenes Into Structured Minecraft Environments — run VLN and other embodied tasks with the game engine as the simulator.
- Continuous, Interpretable Physics Priors for Video Diffusion — targets non-drifting objects and more honest collisions. A concrete patch on the PhyWorld line.
- GRPO Moves Into Latent Space — first attempt at running RL over implicit reasoning chains.
- High-Concurrency Code Sandbox for LLM Code RL Training and Evaluation — ScaleBox prioritizes high-fidelity verification over "it ran."
- Causal Intervention Cuts Reward Models' Dependency on Response Length — more systematic than length normalization.
- OpenAI Releases Evaluation Set From Real Clinician ChatGPT Conversations — medical LLM evaluation moves from mock questions to real workflow scenarios.
- 1084 Expert-Curated Scientific Experiment Figures, 4264 QAs — SPUR targets fine-grained panel-level perception and reasoning.
- Forensic Benchmark for AI-Generated Academic Figures, 7 Categories and 39 Subclasses — AEGIS pushes academic fraud detection into fine-grained evaluation.
Don't miss what's next. Subscribe to AI Research Brief: