iter.ca update #8
Hi subscribers to my email list!
Life updates
I went to LessOnline and Manifest and had a lot of fun. It was really great seeing so many people there! I’m pretty tired from all of that and still catching up on sleep though.
Research
I did some research about natural language autoencoders, which are really cool!
Activation diff oracles
It would be pretty cool if we had a model that could verbalize the difference between two activations at different points in the residual stream, since this would let us see what each layer is doing. You would also be able to see what a single attention head does by looking at the diff from just that head. The big problem with this is training it; you don’t have supervised labels because you haven’t already solved interp. Some ideas for supervised tasks (either for directly training an AO or for warm-starting an NLA):
NLAs work pretty well on layers close to the one they were trained on. So if you have an NLA trained on layer 20, have it generate explanations for layers 18 and 22 then have an LLM write some text about the difference between these two explanations.
I tried doing this, but the generated diff explanations ended up being pretty bad and only really talked about spurious differences that seemed to come from the AV happening to include slightly different things in the activation. Ideas on how to make this work:
I realize as I wrote this I was probably sampling at temperature 1; using greedy sampling would probably reduce variance somewhat.
An NLA trained on layer 20 is good at understanding layer 22 activations because layer 22 activations are pretty similar to layer 20 ones. But the difference between the structure of layer 22 and layer 20 activations is exactly the thing our layer 20 NLA doesn’t understand, so while NLAs can cross-layer generalize they aren’t useful for this because the thing they fail at is exactly the thing we need them to be useful for. Maybe could fix this by RLing the layer 20 transformer to be able to speak layer 22 for a bit?
Maybe I could train a diff oracle by having it learn on a bunch of supervised tasks and hoping it generalizes (like AOs)? (And then maybe extend it to being an autoencoder by asking it for explanations only, adding AR, and doing RL to reduce reconstruction error?) Task ideas:
What tokens did the attention heads attend to most between these two activations?
What tokens does this make more/less likely (using logit lens)?
How would the output be different if this layer was skipped?
NLA steering
If NLAs represent the model’s thoughts, you should be able to modify them and modify how the model acts. The NLA paper has a few cases where this is done to modify very specific thoughts, but can we extend this to steer the model in a broad direction by verbalizing an activation at every token, rewriting the thoughts, then reconstructing? I tried this and unfortunately just adding a sentence like “the assistant likes the color yellow“ to every paragraph in the NLA explanation doesn’t make the model like yellow more; probably this is because this kind of steering is pretty out-of-distribution for the AR.
What does work is giving Claude the explanation and telling it to rewrite it; this makes the generated text weird but heavily steered (to the extend that it often makes the model only care about the color yellow and ignore the prompt). This is a kinda unfair experiment though because Claude can infer most of the immediate previous tokens from quotes in the explanation, choose a reasonable continuation (based on the steering instructions) itself, and rewrite the explanation to push it towards that continuation.
Misc NLA thoughts
I’m pretty confused as to why the warm start and KL term are enough to prevent NLAs from doing steganography to encode the activations. They’re clearly helpful, but it’s not clear to me why they’re sufficient to prevent just steganographically encoding the input activations. It would be useful to know what pushes NLAs towards useful explanations versus steganography.
I’ve tried using nanoNLA for some experiments and it seems to work well
In the NLA paper they multiply the injected activation by a constant (150) so that early layers don’t change the residual stream much (since the residual stream is only normalized at the input to blocks and so grows in norm throughout). I tried replacing this with a learned affine transform and this works slightly better.
What’s next?
I will do some more stuff with interpreting activation diffs. Once I solve the warm-starting problem, there are some really cool things I could do (e.g. creating an NLA that identifies each layer’s contribution to the final output). I might also do some things to see how faithful NLA claims are.
Add a comment: