\"Be Concise\" Halves Tokens, Lifts Accuracy by 16 Points

        March 7, 2026

\"Be Concise\" Halves Tokens, Lifts Accuracy by 16 Points

"Be Concise" Self-Distillation Halves Tokens and Raises Accuracy. Qwen3 on MATH-500: 57% fewer reasoning tokens, 16-point accuracy gain. Redundant reasoning doesn't just waste compute — it actively introduces errors.

Policy Already Knows Where It Fails. RoboPocket projects the model's uncertainty via AR onto a phone screen, letting data collectors target weak spots. Collection efficiency doubles.

Semantic Match Scores High, but the Retrieved Function Crashes. DARE encodes data distribution features into retrieval vectors. NDCG@10 hits 93.47% — 17 points above the best open-source embedding model.

Video Models Can't Do Physics, So Don't Teach Them. RealWonder offloads physical interactions to simulation, then translates to photorealism. 13.2 FPS at 480×832 resolution.

Also Notable

SmoothQuant Hits Two Snags on Multimodal LLMs. Vision and language tokens can't share smoothing coefficients — cross-modal computation invariance doesn't hold. MASQuant processes each modality separately. CVPR.
Static Benchmark Leaderboards Are Losing Signal. Models actively interact under budget constraints to gather information before answering. Tests "knowing what to ask," not just "knowing the answer."
1.58-Bit BitNet Is a Natural Fit for 2:4 Sparse Pruning. Ternary weights already contain many zeros. Pruning barely hurts accuracy. Quantization and sparsity stack better than expected.
Plug-in Module Turns ViT Classifiers into Strong Segmenters. No retraining needed. ICLR.
181 Hours of Real-Life Video Across Days, Weeks, and Months. Existing long-video models fail badly on sparse events over extended time spans.
RL-Trained Enterprise Search Agent. Covers constrained search, cross-document synthesis, table reasoning, and six scenarios total. Outperforms general-purpose LLM agents by a wide margin.
8.3B MoE Foundation Model for Time Series. 11.5K context length. Time series foundation models finally match language model scale.
Diffusion Language Models Held Back by Scattered Accept Strategy. Decoding from the longest stable prefix improves both coherence and speed. ICLR.
Knowledge Graphs Track Manipulative Communication in Long Conversations. Detects gaslighting, guilt-tripping, and similar patterns. Compensates for LLM context window limits. Microsoft.
Test-Time Adaptation Without Backpropagation on Low-End Devices. Forward-only inference optimizes prompts to handle distribution shift. CVPR.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)