Move to See: Top Model Reaches the Target Just 12% of the Time
- Spatial intelligence shifts from passive understanding to active perception. TVR asks an agent to turn and step through a 3D room until its view matches a target photo. The strongest closed model succeeds only 12% of the time, but vision-action SFT pulls a 9B open model from single digits past 50%.
- Long-context compression can preserve code reasoning. LongAttnComp fine-tunes one lightweight scoring layer, trains it once, and reuses it across three model families — matching full context on code debugging after compression.
- VLMs writing code to build 3D models fail in specific ways. 3DCodeBench drops 12 VLMs into real modeling software. Most failures come from wrong API calls and disconnected geometry; multi-turn iteration with execution feedback brings them back.
- The frontier in skill adaptation moves to attribution granularity. SkillAdaptor pushes failure blame down from the whole trajectory to the specific step. Backbones stay frozen, no training needed, and every skill edit is auditable — even if each gain is only +1.5 points.
Also Notable
- Does VLM Document Understanding Transfer Across Languages — HakushoBench builds a Japanese chart and table VQA benchmark from government white papers, targeting the blind spot of non-English document understanding.
- Legal and Humanities Citations Hide in Footnotes — existing extraction tools are built for the structured end-of-paper references of natural science; FOSSIL provides a dataset and pipeline for footnote citations interwoven with commentary.
- Updating Parameters Only at a Few Moments Can Still Be Near-Optimal — an algorithm for linear contextual bandits under a "very few parameter updates" constraint, where observation and action selection stay online but reward feedback merges in only at select moments, close to real engineering limits (ICML accepted).
Don't miss what's next. Subscribe to AI Research Brief: