iter.ca update #7
Hi! I haven’t written as much this month, I had a bit of a post-Inkhaven crash. The only full-length blog post I wrote was Some observations about NLA explanations. I got pretty excited about Natural Language Autoencoders (which are like Activation Oracles but trained in an unsupervised way). I was in Toronto for most of May and it was nice finally being able to spend a while there again.
NLA stuff I did
I did a lot of experiments with NLAs this month using the open-weight models published by Anthropic. Here's some of the stuff I did related to that in May.
Can we make good verbalizations with just the input text (no activation?)
One of the first things I tried doing was to see if I could get explanations that are as good as the ones from an Activation Verbalizer (AV) by just prompting an LLM really good. It seemed like it would be helpful to know if the AV was really providing value above just looking at the input directly.
Before AVs are trained with RL, there's a warm-start phase where they're fine-tuned on a bunch of explanations generated by Sonnet 4.6 from just the input text. Then in the RL stage the model sees the activation only with no input text (although the AV does extract the approximate input text so that it can quote it).
NLAs and capitalization tokens
Anthropic doesn't publish their tokenizer, but you can infer most of how it works through the public API. The token counting endpoint tells you how many input tokens a conversation is, and the generation APIs let you limit output to a given number of tokens. The easiest way to tokenize a string with Ant's tokenizer is to tell a model "repeat 'XYZ'" with maximum output tokens of 1 to get the first token, then increase it to 2 to get the additional output (that you didn't see the first time) which is the second token, and so on.
If you do this you will observe that the model sometimes seems to output zero-length tokens: i.e. you get the exact same output when having a maximum output of 1 token and a maximum of 2 tokens (despite the model not ending its turn). There are a few reasons this can happen (e.g. multi-token stop sequences), but one is capitalization tokens: Anthropic's tokenizer has a special token that the model can emit to capitalize the next token (and another for capitalizing the next whole word). This lets the model capitalize words and acronyms in an easier way. This also means that sometimes you seemingly see zero-length tokens: since the API doesn't expose tokenization internals to you, if the last token the model generates is a capitalization token you get billed for an extra token despite not getting any extra text. This mechanism is described a bit in this post; see section 6 and footnote 8. (Claude Opus 4.7 introduces a new tokenizer but AFAICT it also has capitalization tokens)
A natural question is: how does the AV talk about capitalization tokens? We can look at the AV explanations in the NLA post to get an idea. Normally AV explanations have a final paragraph dedicated to the final token in the input that starts like 'Final token "Absolutely"'. Here are final paragraphs from some explanations for capitalization tokens (I didn't see it talking about capitalization outside of that; the earlier paragraphs are about broader context and not capitalization):
- Final token "": "Capitalizing on" or "Engaging in this action" expected
- Final token "": Capitalized "Blank" followed by incomplete sentence
- Final token "": "... " followed by capital letter indicates new sentence fragment...
- Final token "": The capital letter after period signals new sentence continuation
- Final token "": "Ты написал что-то серьёзное..." or...
I'm not whether sure the quotes are actually empty or there's an invisible capitalization token in there. It's interesting that in the first one (from the Unverbalized Eval Awareness in Blackmail Scenario) it suggests continuing the text with "Capitalizing on" (which would kinda make sense here) even though the actual completion is different. I think this is some (unsubtle) steganography from the AV which is trying to convey the concept of a capitalization token to the AR (activation reconstructor) but instead of outright saying that, it conveys it by suggesting the text will continue with "Capitalizing on".
Giving the AR the input text
Does giving the AR the input text improve reconstruction (without any retraining)? Yes! The explanation is derived from the input text and often quotes it, so it's not surprising the input text helps the AR. It's interesting that this works despite the AR having never seen it in training data.
Giving the AV the input text
I tried the reverse of the above: giving the AV the input text to help it generate better explanations. I tried a bunch of ways of doing this; all of them either hurt the AV or made no difference.
Next steps
Possible next things for me to do: - See if I can steer model generations (e.g. make the model like the color yellow) by adding sentences to the NLA explanation. - Try more ablation experiments where I remove various things from the explanation (e.g. specific sentences) and see how that impacts activation reconstruction.
Inference thing
Did you know that Claude models have observable differences between inference platforms? They do! (based on this).
Misc
- CVE-2026-45033, a security bug that I found was fixed. I will write about this more soon.
- I had lunch at the Toronto downtown Ikea store a couple times; it's pretty good. The "plant balls" are basically indistinguishable from real meatballs and are a bit cheaper.
- I was at an event where we discussed if pausing AI would be good; I liked it!
Future
I'm going to be at LessOnline (I'm on a flight to SF right now!) and Manifest at Lighthaven this month.
Idk what I'm going to do after festival season (LO/MF), probably going to do some more research into meta-models like NLAs and applications for them. It seems like there's a lot of exciting potential there. Unfortunately I don't have a lot of access to compute so I probably won't be able to do much in the way of training NLAs (they're pretty expensive to train).
Wanna chat? You can talk to me on Discord as @moreloops. (You might be able to reply to this email to respond, I still have no idea if that works or not.)
Add a comment: