Lorem Ipsum Rescues GRPO's Wasted Hard Samples

        May 10, 2026

Lorem Ipsum Rescues GRPO's Wasted Hard Samples

Skill1 Unifies Skill Retrieval, Use, and Distillation in One Policy. A single task reward co-trains all three, avoiding interference between competing reward signals. SkillOS attacks the same problem from a different angle the same week. Agent continual learning's bottleneck is shifting from single-step inference to skill library operations.

DCI Lets Agents Grep the Raw Corpus Directly. Skip embeddings, vector indexes, and retrieval APIs. Beats sparse, dense, and reranking baselines on BRIGHT, BEIR subsets, and BrowseComp-Plus. The retrieval bottleneck moves from algorithm to interface.

LoPE Glues Lorem Ipsum Onto the Prompt. From 1.7B to 7B, prepending random Latin beats resampling the original prompt for rescuing GRPO's zero-advantage hard samples. Moving RL exploration from output to input was a path almost nobody had tried seriously.

CDM Brings DMD into Continuous Time. Trajectory density and distribution matching, two competing schools, collapse into one framework. 1-4 step generation no longer leans on GAN or reward patches.

Also Notable

The Other Skill-Library Paper From the Same Day. SkillOS treats "which skills are worth keeping" as a trainable decision, focusing on learning the curation operator.
Trajectory-Level Strategy Sampling for Agentic RL. Improves exploration and credit assignment for reactive policies on long-horizon tasks.
"Auto-Research" as a Closed Loop With External Metrics. Specialized agents collaborate to produce auditable trial trajectories rather than a single checkpoint.
Multi-Reward Balancing for Diffusion RL Fine-Tuning. MARBLE drops multi-expert and fixed-weight setups for an end-to-end approach.
Video Reward Model Decouples Reasoning From Scoring. Think first, then score. The next step for aligning generated video to human preference.
Cola DLM Builds a Hierarchical Latent Diffusion Language Model. A complete generation attempt for non-AR text. Worth a glance for anyone tracking AR alternatives.
Long-Context Understanding by Different Means. MiA-Signature approximates global activations' downstream impact with a compact representation, sidestepping full attention's O(N²).
TIDE Questions "Token Index Injected Once at the Embedding Layer." Re-injects token identity at every layer to address rare-token and long-range degradation.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)