RL训练数据不够用?把简单题拼成难题就行
- 把做对的简单题拼成新难题,Composition-RL让RLVR训练数据的有效利用率大幅提升,4B到30B模型一致涨点
- 5B参数做到80B的活。DeepGen 1.0在图像生成和编辑上同时超越体量大十几倍的对手,代码权重全开源
- 学生不仅能学老师,还能超过老师。ExOPD通过"奖励外推"打破蒸馏的性能天花板,多领域专家知识可合并回小模型
- 1张A6000D跑100万token上下文,MiniCPM-SALA用稀疏+线性注意力混合架构把长上下文推理成本砍到原来的1/3
- Combine solved easy problems into new hard ones. Composition-RL turns wasted RLVR training samples into effective composite challenges, with consistent gains from 4B to 30B models.
- 5B parameters doing the job of 80B. DeepGen 1.0 beats opponents 10x its size in both image generation and editing — code and weights fully open-sourced.
- Students can surpass their teachers. ExOPD breaks the distillation performance ceiling through "reward extrapolation," and multi-domain expert knowledge can be merged back into a single small model.
- 1M-token context on a single A6000D. MiniCPM-SALA's sparse + linear attention hybrid cuts long-context inference cost to a third.
Don't miss what's next. Subscribe to AI论文简报: