Weekly links, March 5 2026
A lot has gone on with Anthropic, their new responsible scaling policy, the DoD, etc.
No links about that this week - instead:
https://alignment.anthropic.com/2026/psm/: Deep explanation of the Persona Selection Model, written in part by Chris Olah (a cofounder of Anthropic).
https://ampcode.com/notes/feedback-loopable: This is super cool. Highly recommend.
https://bounded-regret.ghost.io/oversight-assiturning-compute-into-understanding/: There are many oversight tasks where discovery is hard but verification of discoveries is (relatively) cheap, for example bug discovery in code or reward hacking in RL environments. Can we leverage that to train highly powerful "oversight assistants" - models which are superhuman at helping us oversee other models?
And I made this: