Flow-OPD Lifts GenEval From 63 to 92

        May 13, 2026

Flow-OPD Lifts GenEval From 63 to 92

Image Generation Alignment and LLM Post-Training Now Share One Toolbox. Flow-OPD ports On-Policy Distillation to flow matching. SD 3.5 Medium hits GenEval 92 (up from 63) and OCR 94 (up from 59), about 10 points over plain GRPO.

Test-Time Scaling Strategies Can Be Searched, Not Tuned. AutoTTS lifts researcher work up one level — build a discovery environment instead of designing a strategy. 160 minutes and $39.9 yields policies that transfer across benchmarks and model sizes.

Agent Latency Bottlenecks Are Often Serialized Parallel Opportunities. HyperEyes turns independent sub-retrievals within a single round into parallel atomic actions. The 30B version gains 9.9% accuracy with 5.3× fewer tool-call rounds.

Physical Interaction Data Finally Hits Million-Hour Scale. HumanNet ships 1M hours of human activity video with first- and third-person views. 1000 hours of first-person video beats 100 hours of real robot data for continued training.

One LoRA Adapter Serves Cloud and Edge. MatryoshkaLoRA reorganizes rank as nested hierarchy. Pick a tier per device at deploy time. More stable than DyLoRA at the high-rank end.

Also Notable

A²RD Closes the Long-Video Synthesis Loop With Retrieve-Synthesize-Refine-Update. Yale uses agentic diffusion to suppress semantic drift and narrative collapse over long horizons.
SCOPE Handles Complex Composition Through Structured Decomposition Plus Conditional Skill Orchestration. Introduces "semantic commitment" to explain why multi-constraint image generation drops elements.
Agent Tool-Selection Mistakes Are Already Visible in Hidden State. Imperial College finds tool selection is linearly readable and steerable across 12 instruction-tuned models.
IntentGrasp Fills the "Did the LLM Actually Understand" Eval Gap. 49 open corpora, 12 domains, intent-understanding benchmark.
ModelLens Tackles "How to Pick Among Hundreds of Thousands of Open Models." Skips exhaustive forward passes. Targets new-dataset plus new-model scenarios with no prior records.
InterLV-Search Frees Visual Evidence From Inputs and Answers. Interleaved language-vision agentic search benchmark with three difficulty tiers, 2061 cases total.
BalCapRL Adds Balance to GRPO Training for MLLM Image Captioning. Tackles the conflict between detailed and accurate rewards.
PACEvolve++ Frees Evolutionary Search Agent Policies From Prompt-Elicited Stasis. Improves test-time learning.
Amazon's AGWM Adds Affordance Grounding to World Models. Handles spurious causality from action-result co-occurrence in training data.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)