SFT Convergence Hides Failures, Attention Hijacking Hits 94%
- SFT loss convergence doesn't mean the model learned everything. Five systematic failure modes reproduced across three model families show that aggregate metrics can hide persistently unlearned subsets.
- Reward models don't need CoT reasoning for every score. E-GRM uses generation consistency to estimate uncertainty, skips deep reasoning on easy samples, and improves accuracy while cutting cost.
- Coding agent rankings reshuffle under credit budgets. Frontier agents can't find the optimal accuracy-cost tradeoff under resource constraints, and their behavior is highly path-dependent.
- Hijacking attention weights makes models "blind" to safety instructions, with a 94.4% jailbreak success rate. The attack doesn't force the model to violate rules — it prevents the model from retrieving them during generation.
Also Notable
- LLMs Score Well on ToM Benchmarks but Fail in Practice — Causal intervention methods attempt to align theory-of-mind capabilities at the internal representation level.
- Text-to-CAD Code Generation Needs Assembly Hierarchy — Hierarchical graph representations significantly outperform direct seq2seq approaches.
- Explicit Sentence Boundaries Beat Random Dummy Tokens — Natural language sentence structure itself is a useful inductive bias.
- AI-Generated Classical Poetry Detection Still Lags Human Judgment — AI detection in literary domains is far from solved.
Don't miss what's next. Subscribe to AI Research Brief: