Scrambled Media Boosts Reasoning; 6B Model Tops GPT-4o
- Agent Skills Should Self-Evolve From User Populations. SkillClaw turns multi-user interaction traces into skill evolution signals. One user's correction auto-syncs to everyone, giving agent systems organizational memory.
- Smart Compression Beats Brute-Force Context Windows. Tempo uses a 6B model to select key frames per query. Under an 8K token budget, it outperforms GPT-4o and Gemini 1.5 Pro.
- Lighting Becomes a First-Class Citizen in Video Generation. LiVER decouples lighting, layout, and camera motion through a physics renderer. Accepted at CVPR, targeting professional production workflows.
- Scramble Audio and Video, Let the Model Reassemble. OmniJigsaw uses zero-annotation temporal reordering as a proxy task, forcing models to integrate audiovisual signals. Validated across 15 benchmarks.
Also Notable
- 170K Style Descriptions + 400K Prompts Build a Scalable Data Pipeline. Uses generative models' own style consistency to solve the data bottleneck for style transfer.
- RLVR Improves Accuracy but Degrades Reasoning Chains. CoT decouples from visual evidence. Correct answers don't guarantee correct reasoning.
- Virtual Try-On Starts Caring About Fit. The first try-on dataset with precise sizing annotations. Not just whether it looks good overlaid.
- Gradient-Signal-Driven Adaptive Layer Sampling. Achieves near-full-parameter fine-tuning results with half the memory (ACL).
- Stronger LLMs Cooperate Less Under Zero-Cost Collaboration. Cooperation failure in multi-agent systems is a real risk (ICLR).
- Agent Reward Models Can't Just Evaluate Single Steps. They need to assess entire planning trajectories (ACL).
- Annotation-Free Medical Visual Reasoning. Agentic RL lets models autonomously locate visual evidence before making judgments (ICLR).
- More Training Data Isn't Always Better for Search Agents. A hierarchical experience framework filters high-value trajectories from random exploration.
- Testing VLM Long-Horizon Interaction in a Pokémon 3D Environment. Closer to real agent deployment than static image-text benchmarks.
- Continuously Editing VLM Knowledge Without Forgetting. A subspace alignment method preserves old concepts during updates (CVPR).
Don't miss what's next. Subscribe to AI Research Brief: