SFT Convergence Hides Failures, Attention Hijacking Hits 94%

        April 14, 2026

SFT Convergence Hides Failures, Attention Hijacking Hits 94%

SFT loss convergence doesn't mean the model learned everything. Five systematic failure modes reproduced across three model families show that aggregate metrics can hide persistently unlearned subsets.

Reward models don't need CoT reasoning for every score. E-GRM uses generation consistency to estimate uncertainty, skips deep reasoning on easy samples, and improves accuracy while cutting cost.

Coding agent rankings reshuffle under credit budgets. Frontier agents can't find the optimal accuracy-cost tradeoff under resource constraints, and their behavior is highly path-dependent.

Hijacking attention weights makes models "blind" to safety instructions, with a 94.4% jailbreak success rate. The attack doesn't force the model to violate rules — it prevents the model from retrieving them during generation.

Also Notable

LLMs Score Well on ToM Benchmarks but Fail in Practice — Causal intervention methods attempt to align theory-of-mind capabilities at the internal representation level.
Text-to-CAD Code Generation Needs Assembly Hierarchy — Hierarchical graph representations significantly outperform direct seq2seq approaches.
Explicit Sentence Boundaries Beat Random Dummy Tokens — Natural language sentence structure itself is a useful inductive bias.
AI-Generated Classical Poetry Detection Still Lags Human Judgment — AI detection in literary domains is far from solved.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)