4M Game Frames Train Rendering, Internalized Skills Beat Retrieval
- Discrete Tokens Are LLM's Architectural Ceiling, Not an Optimization Target. A survey traces four technical threads showing core computation migrating from token sequences to continuous latent space.
- Agent Skills Work Better Internalized via RL Than Retrieved at Runtime. SKILL0's progressive withdrawal curriculum improves ALFWorld by 9.7%, with under 500 tokens per step at inference.
- AAA Game Engines Are a Hidden Data Goldmine for Generative Rendering. 4 million synchronized RGB + G-buffer frames produce models that clearly outperform existing solutions on cross-dataset generalization.
- Visual Features Can Be Steered in Real Time with Text Prompts. Injecting cross-attention inside ViT encoder layers enables zero-shot generalization on anomaly detection without degrading general capabilities.
Also Notable
- Cross-Modal Reasoning in Latent Space — avoids the information loss of translating visual content to text. LatentUM
- Multiple LLM Agents Autonomously Explore, Reflect, and Collaborate on Open Problems — no more hardcoded search rules. CORAL
- Near-Identity Distractors Remove Background Dependency from Visual Encoders — identity representations that actually focus on the subject. NearID
- Video Inpainting Beyond Filling Gaps — when the removed object has physical interactions, the entire scene's causal chain needs re-reasoning. VOID
- Autonomous Driving VLAs Can't Do Spatial Awareness and Semantic Reasoning at Once — an attempt to unify both in a single framework. UniDriveVLA
- 3D Textures as an Adversarial Attack Surface — closer to real deployment than 2D patches, a warning sign for VLA model robustness. Tex3D
- Bridging 3D Data Scarcity with 2D Generation — a foundation model unifying text-to-2D and text-to-3D generation. Omni123
- Graph-Based Synthesis of Cross-Modal Multi-Hop Reasoning Data — addressing the single-image limitation of existing multimodal benchmarks. CRIT
- Arbitrary-Resolution Images in a Single Forward Pass — freeing ViTs from pretrained resolution constraints on dense prediction tasks. SPAR
- Visual Riddles Test Visual Reasoning — when images are clues rather than answers, current models' cognitive abilities drop off a cliff. RebusBench
Don't miss what's next. Subscribe to AI Research Brief: