Token Probabilities as Zero-Shot Rewards Hit 0.95 Correlation

        February 25, 2026

Token Probabilities as Zero-Shot Rewards Hit 0.95 Correlation

An LLM builds an internal "world model" of kernel behavior to plan optimization paths. On complex kernels like MoE, it runs 14x faster than evolutionary search, turning operator tuning from random trial-and-error into guided exploration.

VLM token probabilities double as reward signals. Pretrained model logits encode task-progress information. Zero-shot correlation hits 0.947 across 130+ real robot tasks.

Agent memory evaluation has structural flaws. Benchmarks saturate, metrics disconnect from semantic utility, and swapping the backbone model flips conclusions. A checklist for teams building agent systems.

Inverse distillation moves from continuous to discrete domains for diffusion language models. Solving uniqueness and gradient stability unlocks 4–64x step compression.

Also Notable

Spectral evolution patterns guide diffusion model caching, replacing raw feature distance for reuse decisions.
One-step text-to-image models gain editing capability through low-energy transport paths instead of brute-force vector arithmetic.
Cycle-consistent masks build object-level correspondence between first-person and third-person views, enabling viewpoint-invariant representations.
Google proposes a multimodal personalization benchmark using simulated digital footprints to test VLMs' ability to infer user preferences from history.
Test-time scaling for general LLM agents: more inference-time compute helps in some scenarios, pure waste in others.
Agent failures stem from path drift, not capability gaps. Canonical path deviation provides a causal explanation for reliability breakdowns.
Isotropic Gaussian representations stabilize deep RL training under non-stationary targets, with provable advantages.
MIT answers why ReLU works from a computational complexity angle. Training is NP-complete in bit models; real-valued models with ReLU remain tractable.
CLIP prompt tuning boosts accuracy but breaks calibration. Two regularization terms restore confidence reliability.
Proactive reconstruction attacks detect LLM training data membership by fine-tuning probe models to test whether data was seen during training.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)