Swap the Arm Without Retraining; VLMs See Both the Duck and the Rabbit
- Swap a robot arm and the whole skill set breaks — the fix is rewiring, not retraining. RECENT writes skills as executable code and locally refactors only the execution bindings that shift with body or environment, letting a small model handle grounding on-device and matching the large-model version's task performance.
- Robust-U1 makes the model repair the image before answering, turning robustness into an observable intermediate. A three-stage self-recovery path handles blur, noise, and occlusion — the visual corruption that only shows up in production — at the cost of an extra reconstruction step.
- VLMs actually "see" both readings of a duck-rabbit image. Probes find 72% of bistable images light up features for both interpretations on the vision side; the bottleneck for steering sits downstream in language, not in the vision tower.
- Atmospheric compensation in standoff infrared imaging, long shelved, gets a set-based treatment. The work jointly inverts multiple radiance measurements of one scene as an unordered set; what transfers is the modeling stance, not the LWIR setting itself.
Also Notable
- Multiple Teaching Agents Each Propose a Reasonable Plan, but the Student Gets One Answer — a voting protocol coordinates multi-agent collaboration, treating disagreement as a governance problem rather than a capability gap.
- A Map for Spending More Compute at Inference Time in Multimodal Models — a systematic survey of test-time scaling across generation and reasoning in multimodal foundation models.
Don't miss what's next. Subscribe to AI Research Brief: