Web Agent在线RL突破70%，奖励模型换个方向思考就行

        February 17, 2026

Web Agent在线RL突破70%，奖励模型换个方向思考就行

Web导航Agent在线学习终于跑通了，OpAgent在WebArena上达到71.6%成功率，比之前所有单体模型翻倍

奖励模型不一定要"正着判"。FLIP反过来推断指令，小模型比LLM-as-Judge强79.6%

RL不止能训生成模型，也能训embedding模型的推理链，Embed-RL让跨模态检索学会了"先想再查"

阅读全文 →

Online RL finally works for web navigation agents. OpAgent hits 71.6% on WebArena — more than doubling every previous monolithic baseline.

Reward models don't have to judge forward. FLIP infers backward to the instruction, and a 7–9B model beats LLM-as-Judge by 79.6%.

RL isn't just for alignment anymore — it now trains embedding reasoning chains. Embed-RL teaches cross-modal retrieval to "think before matching."

Read more →

                                Don't miss what's next. Subscribe to AI论文简报:

            Email address (required)