可读规则不该学进LLM权重


            
        May 15, 2026
    
    
可读规则不该学进LLM权重


可读 dynamics 不该学进权重：Enterprise World Models 用 CascadeBench 证明，跨 tenant 漂移的 business rule 学得越好越脆，58 upvotes 在重画 RAG/工具调用与模型内部知识的边界。


AlphaGRPO 让 UMM 省掉 cold-start——把多模态奖励拆成原子级可验证问题（DVReward），GRPO 直接 unlock self-reflective refinement，没训编辑任务的 GEdit 也涨了。


ToolCUA 把训练目标从单步动作改到路径编排，OSWorld-MCP 从 baseline 的 28% 提到 46.85%，比纯 GUI 设定也多 3.9%——CUA 失败在路径，不在单步。


L2P 扔掉 VAE 换大 patch token：冻住预训练 LDM 当先验抽取器，8 张 GPU + 纯合成数据完成迁移，原生 4K，代价是 GenEval 只到 93%。


异步 RL 的 importance ratio 被悄悄算错：training-inference discrepancy 和 policy staleness 混在一起触发 silent semantic mismatch，PPO-EWMA 是低成本修法。


也值得关注

CHAS 攻 CUA 的另一面——长尾交互数据稀缺 — 和 ToolCUA 同一天，给出复杂、低频 GUI 交互的合成方法和 benchmark。Covering Human Action Space
图像编辑 benchmark 和 reward model benchmark 一起出 — 针对当前 frontier 模型评估天花板，Edit-Compass + EditReward-Compass 统一框架。Edit-Compass
把 thoughts/inputs/outputs 拆成并行 stream — 挑战 agent 必须走单条 message 序列的默认设定。Multi-Stream LLMs
tool-using agent 的不安全发生在轨迹级别而非最终 response — 轨迹级 on-policy 自演化避开传统 safety-utility tradeoff。On-Policy Self-Evolution
把预训练 LLM 改造成 looped latent refinement 模型 — test-time compute scaling 不必从头训 recurrent，可直接复用现成 LLM。LoopUS
World prediction 和 action generation 互相耦合 — DAWN 挑战"predict-then-act"的串行假设，maneuver 与 scene evolution 互为条件。DAWN
long-horizon agent 改成"map-then-act" — 先建环境地图再执行，而不是 reactively 边走边推断约束。MAP
诱导 LRM overthink 的黑盒 DoS 攻击 — hierarchical genetic algorithm 触发过度思考，推理模型的算力可用性是新攻击面。Inducing Overthink
diffusion-based VLA 的 speculative inference 框架 — 大部分步骤跳过完整推理，把 dVLA 的实时部署做下来。Realtime-VLA FLASH
planner 和 simulator 协同进化解决操控数据稀缺 — RoboEvolve 绕开 VLM/VGM 的 semantic-spatial misalignment。RoboEvolve

阅读完整版 →
    

                                Don't miss what's next. Subscribe to AI论文简报:
                            
                        
            Email address (required)