9K Samples Rival R1, Most RL Gains Trace Back to SFT

        March 3, 2026

9K Samples Rival R1, Most RL Gains Trace Back to SFT

A 4B reasoning model trained on 9K curated samples approaches DeepSeek-R1. CHIMERA shows the real bottleneck in reasoning training is domain coverage and data curation, not scale.

Attention steering is finally production-ready. SEKA edits key embeddings in the frequency domain, bypassing FlashAttention compatibility issues. Training-free, negligible latency. Accepted at ICLR.

Foundation vision models carry geometric priors strong enough to replace sensor calibration. VGGT-Det beats prior best by 4–8 mAP on calibration-free 3D detection. Accepted at CVPR.

RL post-training mostly sharpens output distributions, not expanding capability boundaries. Controlled experiments show SFT's coverage is the real prerequisite for performance gains.

Also Notable

Mixture of Diffusion Decouples Text Understanding and Visual Generation While Sharing a Backbone. LLaDA-o uses masked diffusion for text and continuous diffusion for images, reducing redundant computation.
When RL Can't Sample Correct Solutions, Reference Solutions Guide Exploration but Can't Be Imitated. Human proofs fall outside the model's distribution. SFT can't learn them, but they serve as directional anchors for RL search.
GRPO's Advantage Signal Vanishes on Both Too-Hard and Too-Easy Problems. DIVA-GRPO restores gradient signal with difficulty-adaptive advantage computation. Accepted at ICLR.
Visual Encoder Aligned to SONAR's 1,500-Language Space. V-SONAR directly reuses existing multilingual infrastructure for cross-modal retrieval without retraining the text side.
Multi-Agent Communication Topology Shouldn't Be Fixed. CARD dynamically generates optimal topology based on task conditions, outperforming fixed fully-connected or chain topologies.
Fine-Tuning Safety Risks Operate at Token Level, Not Sample Level. Per-token filtering is more precise than dropping entire samples, preserving more useful training signal.
LLM Embedding Spaces Have Lattice Structure. Unifies the linear representation hypothesis and formal concept analysis under a single mathematical framework.
Visual Programming Framework for 3D Spatial Reasoning. pySpatial works zero-shot without 3D training data, replacing end-to-end learning with code generation.
Unlearning Without Gradient Ascent or Retraining. Directly smoothing attention weights stably erases memories with fewer side effects than existing methods.
LUT Plus Spatial Shifting for Image Restoration. ShiftLUT expands the receptive field without increasing storage or compute, suitable for on-device deployment.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)