8% of Tokens Decide the Reasoning Gap
- "Unlearnable" Samples in RLVR. A set of hard examples never gets learned across training, even though rollouts produced correct answers. The reward curve climbs anyway — the easier subset does the work.
- The Reasoning Advantage Is Sparse. The gap between base and reasoning models concentrates in about 8% of tokens, enriched at early planning decisions.
- Single-Model Red-Teaming Isn't Real Protection. Query a set of frontier models concurrently and any weak link delivers harmful output. Success rates reach 100%.
- WOW-Seg Skips the Text Prompt. Meta's Mask2Token aligns masks directly to VLLM feature space. 1/8 the parameters, beats prior SOTA on LVIS.
- 3D Reconstruction Adds Hallucination Score Maps to Diffusion Priors. HAD uses a feedforward novel-view network for cross-validation. Unreliable pixels get masked at pixel resolution.
Also Notable
- D²Evo Pairs Two-Level Difficulty Estimation With "Medium Samples Drifting During Training." Read alongside today's RLVR Unlearnability paper. Together they cover both ends of curriculum recalibration: cut the unlearnable, chase the medium.
- GUI Agent Self-Evolution Writes Past Episodes Into Retrievable Memory Instead of Context. Sidesteps the two old problems with multi-step tasks: context window limits and static policy adaptability.
- TRACE Does Evidence Grounding Across Multiple Videos. Video agents handling long heterogeneous corpora no longer get capped by context budget. Locate and attribute evidence scattered across multiple videos.
- Geometric Theory for SSL Projection Heads. Models the head as a trainable Riemannian metric. Gives an explanation for collapse and invariance observations from engineering practice.
- PluRule: Same Content, Different Community Rules, Different Compliance Calls. Pluralistic governance pushes content moderation models into compositional stress tests, not single rulebooks.
- Modality-Missing Sentiment Analysis Drops Feature Completion for Decision Drift. Modality loss and quality imbalance are the real-data norm. Generative completion has its own costs.
- Contamination Robustness for Multi-Task Linear Regression. Theoretical, but back-solves an upper bound on outlier-task tolerance for real multi-task training.
Don't miss what's next. Subscribe to AI Research Brief: