RL训练数据不够用？把简单题拼成难题就行

        February 15, 2026

RL训练数据不够用？把简单题拼成难题就行

把做对的简单题拼成新难题，Composition-RL让RLVR训练数据的有效利用率大幅提升，4B到30B模型一致涨点

5B参数做到80B的活。DeepGen 1.0在图像生成和编辑上同时超越体量大十几倍的对手，代码权重全开源

学生不仅能学老师，还能超过老师。ExOPD通过"奖励外推"打破蒸馏的性能天花板，多领域专家知识可合并回小模型

1张A6000D跑100万token上下文，MiniCPM-SALA用稀疏+线性注意力混合架构把长上下文推理成本砍到原来的1/3

阅读全文 →

Combine solved easy problems into new hard ones. Composition-RL turns wasted RLVR training samples into effective composite challenges, with consistent gains from 4B to 30B models.

5B parameters doing the job of 80B. DeepGen 1.0 beats opponents 10x its size in both image generation and editing — code and weights fully open-sourced.

Students can surpass their teachers. ExOPD breaks the distillation performance ceiling through "reward extrapolation," and multi-domain expert knowledge can be merged back into a single small model.

1M-token context on a single A6000D. MiniCPM-SALA's sparse + linear attention hybrid cuts long-context inference cost to a third.

Read more →

                                Don't miss what's next. Subscribe to AI论文简报:

            Email address (required)