AI Research Brief

Archives
May 2, 2026

FD as Loss: One-Step Generation Hits 0.72 FID

  • Heterogeneous scientific foundation model collaboration: Eywa pulls LLMs back from "general solver" to coordinator, handing protein structure and physics simulation tasks to domain-specialized predictors.
  • FD estimation decoupled from gradient batches: Fréchet Distance, stuck as an evaluation metric for years, becomes a real training loss. One-step generation hits 0.72 FID on ImageNet 256 in post-training.
  • Ambiguous instructions plus an interactive action space: InteractWeb-Bench makes "actively clarify intent" a required capability, and frontier multimodal agents fall into blind execution under it.
  • Production-grade "world as the agent sees it": Synthetic Computers at Scale builds 1000 user-specific computers with 8+ hour simulations, shifting the long-horizon training bottleneck from trajectory generation to environment synthesis.

Also Notable

  • Five-Level Visual Generation Taxonomy Pushes Direction From Atomic Appearance Mapping to Agentic World Modeling — a framework, not a new model. Value is in redrawing the lanes.
  • Research Infrastructure Upgrades From Citation Graphs to Explicit Method Evolution Graphs — Intern-Atlas is built as backbone for AI scientist systems.
  • Skeleton-Agnostic End-to-End Mocap Skips Non-Differentiable IK — MoCapAnything V2 predicts joint rotations directly, so noisy video-to-pose isn't gated by a middle layer.
  • 3D Semantic Occupancy Turns Real Scenes Into Structured Minecraft Environments — run VLN and other embodied tasks with the game engine as the simulator.
  • Continuous, Interpretable Physics Priors for Video Diffusion — targets non-drifting objects and more honest collisions. A concrete patch on the PhyWorld line.
  • GRPO Moves Into Latent Space — first attempt at running RL over implicit reasoning chains.
  • High-Concurrency Code Sandbox for LLM Code RL Training and Evaluation — ScaleBox prioritizes high-fidelity verification over "it ran."
  • Causal Intervention Cuts Reward Models' Dependency on Response Length — more systematic than length normalization.
  • OpenAI Releases Evaluation Set From Real Clinician ChatGPT Conversations — medical LLM evaluation moves from mock questions to real workflow scenarios.
  • 1084 Expert-Curated Scientific Experiment Figures, 4264 QAs — SPUR targets fine-grained panel-level perception and reasoning.
  • Forensic Benchmark for AI-Generated Academic Figures, 7 Categories and 39 Subclasses — AEGIS pushes academic fraud detection into fine-grained evaluation.

Read the full edition →

Don't miss what's next. Subscribe to AI Research Brief:
Powered by Buttondown, the easiest way to start and grow your newsletter.