Web Agent在线RL突破70%,奖励模型换个方向思考就行
- Web导航Agent在线学习终于跑通了,OpAgent在WebArena上达到71.6%成功率,比之前所有单体模型翻倍
- 奖励模型不一定要"正着判"。FLIP反过来推断指令,小模型比LLM-as-Judge强79.6%
- RL不止能训生成模型,也能训embedding模型的推理链,Embed-RL让跨模态检索学会了"先想再查"
- Online RL finally works for web navigation agents. OpAgent hits 71.6% on WebArena — more than doubling every previous monolithic baseline.
- Reward models don't have to judge forward. FLIP infers backward to the instruction, and a 7–9B model beats LLM-as-Judge by 79.6%.
- RL isn't just for alignment anymore — it now trains embedding reasoning chains. Embed-RL teaches cross-modal retrieval to "think before matching."
Don't miss what's next. Subscribe to AI论文简报: