Recalibrating the Critic Lifts Reasoning Models 18 Points

        April 25, 2026

Recalibrating the Critic Lifts Reasoning Models 18 Points

Self-Trained Reasoning Models Stall Because the Critic Drifts. TEMPO recalibrates the critic against a small labeled set. OLMO3-7B jumps from 33% to 51% on AIME 2024, Qwen3-14B from 42% to 66%. Diversity holds.

8M–30M Micro LMs Write the First 4–8 Words On-Device. A cloud model continues asynchronously. From the user's perspective, latency disappears; the device-vs-cloud question stops being either-or.

LoRA's "Locality" Is a Diagnostic Axis Worth Isolating. ShadowPEFT moves adaptation from weight space to layer space using a centralized shadow network. Same architectural signal as the B-matrix symmetry paper from two days ago.

What Gives Away an AI Shopping Video Isn't Picture Quality. It's hand and face anomalies plus fingers clipping through products. CoInteract bakes spatial structure into generation through dual-stream training, with the auxiliary stream removed at inference, so generation cost stays flat.

Also Notable

AnyRecon Treats Video Diffusion as a Universal 3D Reconstruction Prior — Feeds any number of unordered inputs straight in, sidestepping the geometric consistency problem under sparse views.
Tstars-Tryon 1.0 Publishes Engineering Trade-offs for Production Virtual Try-On — Stability under extreme pose, lighting, and motion blur, plus serving latency, with real deployment detail.
SmartPhotoCrafter Couples Reasoning, Generation, and Optimization Into End-to-End Photo Editing — Sidesteps the entry pain of non-experts who can't write aesthetic instructions.
Chat2Workflow Is the First Benchmark for LLMs Generating Executable Visual Workflows from Natural Language — Pulls the direction from engineering experiment to quantifiable comparison.
15 LLMs Across 8 Tasks Show Zero-Shot Ability Explains Only Part of Final Optimization Variance — Where the rest comes from is worth digging into.
CityRAG Turns City Generation Into a Controllable Simulation Environment for Autonomous Driving — Supports arbitrary weather and dynamic-object configuration.
DASH-KV Accelerates Long-Context Inference With Asymmetric KV Cache Hashing — Sidesteps the generation-quality trade-off in standard KV compression.
GRASPrune Jointly Prunes FFN Channels and KV Head Groups Post-Pretraining — Structured pruning under a unified budget.
Treats Evaluation, Not Models, as the Real Bottleneck for Scientific Discovery — A perspective-flipped diagnostic.
RARE Moves RAG Evaluation From "Documents Are Distinct" Onto Earnings Reports, Legal, and Patents — Redundancy-aware is the next gap in RAG evaluation.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)