Swap the Arm Without Retraining; VLMs See Both the Duck and the Rabbit

        June 9, 2026

Swap the Arm Without Retraining; VLMs See Both the Duck and the Rabbit

Swap a robot arm and the whole skill set breaks — the fix is rewiring, not retraining. RECENT writes skills as executable code and locally refactors only the execution bindings that shift with body or environment, letting a small model handle grounding on-device and matching the large-model version's task performance.

Robust-U1 makes the model repair the image before answering, turning robustness into an observable intermediate. A three-stage self-recovery path handles blur, noise, and occlusion — the visual corruption that only shows up in production — at the cost of an extra reconstruction step.

VLMs actually "see" both readings of a duck-rabbit image. Probes find 72% of bistable images light up features for both interpretations on the vision side; the bottleneck for steering sits downstream in language, not in the vision tower.

Atmospheric compensation in standoff infrared imaging, long shelved, gets a set-based treatment. The work jointly inverts multiple radiance measurements of one scene as an unordered set; what transfers is the modeling stance, not the LWIR setting itself.

Also Notable

Multiple Teaching Agents Each Propose a Reasonable Plan, but the Student Gets One Answer — a voting protocol coordinates multi-agent collaboration, treating disagreement as a governance problem rather than a capability gap.
A Map for Spending More Compute at Inference Time in Multimodal Models — a systematic survey of test-time scaling across generation and reasoning in multimodal foundation models.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)