Agent从80分涨到90分，失败模式没变

        February 20, 2026

Agent从80分涨到90分，失败模式没变

Agent准确率从80涨到90，失败模式几乎没变。 14个模型实测显示，能力提升并未带来可靠性同步改善，从demo到production的决策应看失败条件而非平均分

VLM+仿真RL绕过示教数据瓶颈： HERO让人形机器人零样本操控从未见过的物体，末端跟踪误差降低3.2倍

Fast weight长文本瓶颈不在架构，在训练目标 — 换成next-sequence prediction配合RL，固定内存模型在长文本任务上首次具备实用竞争力

冷启动和偏好漂移被一个框架同时解决， Princeton的PAHF用持续学习加双反馈通道让Agent跟上用户不断变化的偏好

阅读全文 →

Agents Scored 80→90, but Failure Modes Barely Changed. Testing 14 models shows capability gains don't translate to reliability gains. Demo-to-production decisions should hinge on failure conditions, not average accuracy.

VLM + Sim RL Bypasses the Demonstration Data Bottleneck. HERO lets humanoid robots manipulate never-seen objects zero-shot, cutting end-effector tracking error by 3.2x.

Fast Weight Long-Context Bottleneck Is the Training Objective, Not the Architecture. Switching to next-sequence prediction with RL makes fixed-memory models competitive on long-context tasks for the first time.

Cold Start and Preference Drift, Solved in One Framework. Princeton's PAHF uses continual learning with dual feedback channels so agents keep up with shifting user preferences.

Read more →

                                Don't miss what's next. Subscribe to AI论文简报:

            Email address (required)