AI Research Brief

Archives
April 14, 2026

SFT Convergence Hides Failures, Attention Hijacking Hits 94%

  • SFT loss convergence doesn't mean the model learned everything. Five systematic failure modes reproduced across three model families show that aggregate metrics can hide persistently unlearned subsets.
  • Reward models don't need CoT reasoning for every score. E-GRM uses generation consistency to estimate uncertainty, skips deep reasoning on easy samples, and improves accuracy while cutting cost.
  • Coding agent rankings reshuffle under credit budgets. Frontier agents can't find the optimal accuracy-cost tradeoff under resource constraints, and their behavior is highly path-dependent.
  • Hijacking attention weights makes models "blind" to safety instructions, with a 94.4% jailbreak success rate. The attack doesn't force the model to violate rules — it prevents the model from retrieving them during generation.

Also Notable

  • LLMs Score Well on ToM Benchmarks but Fail in Practice — Causal intervention methods attempt to align theory-of-mind capabilities at the internal representation level.
  • Text-to-CAD Code Generation Needs Assembly Hierarchy — Hierarchical graph representations significantly outperform direct seq2seq approaches.
  • Explicit Sentence Boundaries Beat Random Dummy Tokens — Natural language sentence structure itself is a useful inductive bias.
  • AI-Generated Classical Poetry Detection Still Lags Human Judgment — AI detection in literary domains is far from solved.

Read the full edition →

Don't miss what's next. Subscribe to AI Research Brief:
Powered by Buttondown, the easiest way to start and grow your newsletter.