Flow-OPD Lifts GenEval From 63 to 92
- Image Generation Alignment and LLM Post-Training Now Share One Toolbox. Flow-OPD ports On-Policy Distillation to flow matching. SD 3.5 Medium hits GenEval 92 (up from 63) and OCR 94 (up from 59), about 10 points over plain GRPO.
- Test-Time Scaling Strategies Can Be Searched, Not Tuned. AutoTTS lifts researcher work up one level — build a discovery environment instead of designing a strategy. 160 minutes and $39.9 yields policies that transfer across benchmarks and model sizes.
- Agent Latency Bottlenecks Are Often Serialized Parallel Opportunities. HyperEyes turns independent sub-retrievals within a single round into parallel atomic actions. The 30B version gains 9.9% accuracy with 5.3× fewer tool-call rounds.
- Physical Interaction Data Finally Hits Million-Hour Scale. HumanNet ships 1M hours of human activity video with first- and third-person views. 1000 hours of first-person video beats 100 hours of real robot data for continued training.
- One LoRA Adapter Serves Cloud and Edge. MatryoshkaLoRA reorganizes rank as nested hierarchy. Pick a tier per device at deploy time. More stable than DyLoRA at the high-rank end.
Also Notable
- A²RD Closes the Long-Video Synthesis Loop With Retrieve-Synthesize-Refine-Update. Yale uses agentic diffusion to suppress semantic drift and narrative collapse over long horizons.
- SCOPE Handles Complex Composition Through Structured Decomposition Plus Conditional Skill Orchestration. Introduces "semantic commitment" to explain why multi-constraint image generation drops elements.
- Agent Tool-Selection Mistakes Are Already Visible in Hidden State. Imperial College finds tool selection is linearly readable and steerable across 12 instruction-tuned models.
- IntentGrasp Fills the "Did the LLM Actually Understand" Eval Gap. 49 open corpora, 12 domains, intent-understanding benchmark.
- ModelLens Tackles "How to Pick Among Hundreds of Thousands of Open Models." Skips exhaustive forward passes. Targets new-dataset plus new-model scenarios with no prior records.
- InterLV-Search Frees Visual Evidence From Inputs and Answers. Interleaved language-vision agentic search benchmark with three difficulty tiers, 2061 cases total.
- BalCapRL Adds Balance to GRPO Training for MLLM Image Captioning. Tackles the conflict between detailed and accurate rewards.
- PACEvolve++ Frees Evolutionary Search Agent Policies From Prompt-Elicited Stasis. Improves test-time learning.
- Amazon's AGWM Adds Affordance Grounding to World Models. Handles spurious causality from action-result co-occurrence in training data.
Don't miss what's next. Subscribe to AI Research Brief: