\"Be Concise\" Halves Tokens, Lifts Accuracy by 16 Points
- "Be Concise" Self-Distillation Halves Tokens and Raises Accuracy. Qwen3 on MATH-500: 57% fewer reasoning tokens, 16-point accuracy gain. Redundant reasoning doesn't just waste compute — it actively introduces errors.
- Policy Already Knows Where It Fails. RoboPocket projects the model's uncertainty via AR onto a phone screen, letting data collectors target weak spots. Collection efficiency doubles.
- Semantic Match Scores High, but the Retrieved Function Crashes. DARE encodes data distribution features into retrieval vectors. NDCG@10 hits 93.47% — 17 points above the best open-source embedding model.
- Video Models Can't Do Physics, So Don't Teach Them. RealWonder offloads physical interactions to simulation, then translates to photorealism. 13.2 FPS at 480×832 resolution.
Also Notable
- SmoothQuant Hits Two Snags on Multimodal LLMs. Vision and language tokens can't share smoothing coefficients — cross-modal computation invariance doesn't hold. MASQuant processes each modality separately. CVPR.
- Static Benchmark Leaderboards Are Losing Signal. Models actively interact under budget constraints to gather information before answering. Tests "knowing what to ask," not just "knowing the answer."
- 1.58-Bit BitNet Is a Natural Fit for 2:4 Sparse Pruning. Ternary weights already contain many zeros. Pruning barely hurts accuracy. Quantization and sparsity stack better than expected.
- Plug-in Module Turns ViT Classifiers into Strong Segmenters. No retraining needed. ICLR.
- 181 Hours of Real-Life Video Across Days, Weeks, and Months. Existing long-video models fail badly on sparse events over extended time spans.
- RL-Trained Enterprise Search Agent. Covers constrained search, cross-document synthesis, table reasoning, and six scenarios total. Outperforms general-purpose LLM agents by a wide margin.
- 8.3B MoE Foundation Model for Time Series. 11.5K context length. Time series foundation models finally match language model scale.
- Diffusion Language Models Held Back by Scattered Accept Strategy. Decoding from the longest stable prefix improves both coherence and speed. ICLR.
- Knowledge Graphs Track Manipulative Communication in Long Conversations. Detects gaslighting, guilt-tripping, and similar patterns. Compensates for LLM context window limits. Microsoft.
- Test-Time Adaptation Without Backpropagation on Low-End Devices. Forward-only inference optimizes prompts to handle distribution shift. CVPR.
Don't miss what's next. Subscribe to AI Research Brief: