NLA explanations can be shortened without harming reconstruction

Researchers trained Qwen3-8B natural language autoencoders with varying length penalties and found that explanation length can be significantly reduced without harming reconstruction fidelity, suggesting many tokens in standard explanations are not essential for faithful reconstruction.

Natural language autoencoders https://transformer-circuits.pub/2026/nla/ are a really cool mostly-unsupervised method for producing free-form text explanations of LLM activations. You should read that paper or the blog post https://www.anthropic.com/research/natural-language-autoencoders about them before reading this. I trained 1 several Qwen3-8B NLAs with different length penalties: during RL, I subtracted the token count multiplied by the length penalty hyperparameter λ from the RL reward Being able to reduce the length so much without impacting FVE is interesting because it could mean that large parts of NLA explanations aren't actually useful for reconstructing the input activation faithfully. Some of this is because the length penalty makes the model use terser wording to convey the same ideas; I'm not sure how much of these results stem from terser wording vs omitting unneeded information. Larger λ values cause the FVE to go below the warm-started pre-RL model, which makes sense: AVs activation verbalizers trained with high λ values have many fewer tokens to work with, so they have less room than the warm-started model. It's interesting that they're still pretty good relative to λ=0 though There are two main reasons the AV writes explanations that are much longer than they need to be with standard NLAs: The optimal explanation length probably varies a lot by model. Bigger models think about more stuff and might have less unused "slack" in their activation explanations. It would be pretty interesting to see what that curve from my graph above would look like with multiple models, across more penalties, and with more RL steps. Note that I did 250 RL steps for all of those penalties, but the RL runs for the higher penalties completed faster because a large part of the time spent in RL is AV explanation generation which is faster when you're generating fewer tokens; it might have been more fair to give each RL run the same amount of time rather than the same amount of RL steps. Here's a widget I vibed up to let you see some explanations from the NLA I trained I think it's pretty interesting to look at a few of these. The really high length penalty NLAs mostly end up only including the most important bits of the explanation. The highest length penalty one λ=0.03 usually only includes a few words that repeat the tail of the input text or just the final token , and occasionally it gives an empty explanation. This makes sense: the input text itself is very useful for reconstruction and if you only have a few tokens of explanation just repeating the final few seems pretty useful. It's important to note that while NLA training encourages better reconstructing of the activation as a whole, it doesn't necessarily encourage reconstructing the bits that us humans care about the most Repeating the final token is not interesting for interpretability we already know that , but very useful for reconstructing activations. It's pretty plausible that things like eval awareness only make up a fairly small part of activation space but take up a larger part of the space of things I care about. This might mean that a large portion of what AVs say aren't actually useful for reconstructing activations, but are instead just there to help satisfy the KL penalty and because there's not a strong pressure to omit useless bits. It might also be interesting to look at what kinds of things get dropped with more severe length penalties more systematically. It would also be interesting to more systematically check if steganography is happening here like they do in the NLA paper . The explanation length didn't plateau during the 250 steps of RL I ran for λ=0.001 or λ=0.002; with more training you could probably get even more token-efficient explanations. It would be interesting to see how the length penalty vs FVE tradeoff changes with more RLing. Continuing training of the open weight NLAs Originally I experimented with continuing RL training on the open weight NLAs from Anthropic instead of training my own from scratch using the original open source training code https://github.com/kitft/natural language autoencoders , and I got fairly similar results to what I described above for two penalty values in some smaller tests. Because the length penalized NLAs started from the same base NLA, in the samples I looked at, I saw that the length penalized ones tended to write explanations that looked like cut-down versions of the explanations from the base NLA. I switched to nanoNLA and training my NLAs from scratch, because it was simpler and avoided a bunch of cuda and infra problems. I put all of the models and data I used to create this post on Hugging Face https://huggingface.co/collections/syvb/nla-length-penalty . It might be good to have a smallish length penalty when training NLAs to try to get them to be more faithful and avoid spurious claims. It would also be interesting to try this with larger models and more penalty values. Thanks to Celeste https://celeste.computer/ and @jim https://jimfund.com/ for giving me feedback on this research. I used Celeste's nanoNLA https://github.com/ceselder/nanoNLA instead of the original NLA code because it made the infra simpler to manage. nanoNLA isn't an exact reimplementation of the original but I don't think any of the changes would affect the results I got much. This doesn't change the warm-start SFT phase of training an NLA, so I reused the same warm-started NLA for everything.