{"slug": "natural-language-autoencoders-produce-explanations-of-llm-activations", "title": "Natural Language Autoencoders Produce Explanations of LLM Activations", "summary": "Anthropic researchers introduced Natural Language Autoencoders (NLAs), an unsupervised method that generates natural language explanations of LLM activations by training two modules to reconstruct activations through a text bottleneck. Applied to auditing Claude Opus 4.6, NLAs diagnosed safety-relevant behaviors and detected unverbalized evaluation awareness, and NLA-equipped agents outperformed baselines on an automated auditing benchmark. The team released training code and trained NLAs for popular open models.", "body_md": "Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations\n\nNatural Language Autoencoders Produce Unsupervised Explanations of LLM Activations\n\nAuthors\n\nKit Fraser-Taliente*, Subhash Kantamneni*‡, Euan Ong*, Dan Mossing, Christina Lu, Paul C. Bogdan Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, Samuel Marks\n\nAffiliations\n\nAnthropic\n\nPublished\n\nMay 7, 2026\n\n* Equal contribution, author order alphabetical; ‡ Correspondence to subhash@anthropic.com\n\nWe introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.\n\nWe apply NLAs to model auditing. During our pre-deployment audit of Claude Opus 4.6, NLAs helped diagnose safety-relevant behaviors and surfaced unverbalized evaluation awareness—cases where Claude believed, but did not say, that it was being evaluated. We present these audit findings as case studies and corroborate them using independent methods. On an automated auditing benchmark requiring end-to-end investigation of an intentionally-misaligned model, NLA-equipped agents outperform baselines and can succeed even without access to the misaligned model’s training data.\n\nNLAs offer a convenient interface for interpretability, with expressive natural language explanations that we can directly read. To support further work, we release training code and trained NLAs for popular open models.\n\nIntroduction\n\nLanguage models encode their internal state as high-dimensional activation vectors. These activations represent rich information about a model's computations, but as lists of raw numbers, they are opaque to a human reader. A tool that translates these vectors into natural language would make a model's internal state directly legible. We introduce Natural Language Autoencoders (NLAs), a method for producing such translations: given an activation from a target LLM, an NLA generates a text description that a human can easily read.\n\nNLAs consist of two LLM modules: the activation verbalizer (AV) and the activation reconstructor (AR). The AV maps activations to text descriptions, which we call explanations. The AR converts these explanations back to activations. The AV and the AR are initialized as copies of the target LLM, and together form an autoencoder that reconstructs the target's activations through a natural language bottleneck. We jointly optimize the AV and AR to minimize reconstruction loss using reinforcement learning (RL).\n\nThis training process does not explicitly incentivize NLA explanations to be interpretable or faithful. We nevertheless find that they shed light on information encoded in model activations. On a suite of quantitative evaluations, NLA explanations grow more informative throughout training across three models (Claude Haiku 3.5, Haiku 4.5, and Opus 4.6). For instance, NLAs surface traits of a user, such as their gender or occupation, that are inferrable from context.\n\nTo build intuition for what NLA explanations look like in practice, we present a series of case studies, many drawn from our pre-deployment alignment assessment of Claude Opus 4.6. We validate these case studies using independent methods, such as other interpretability techniques, prompt variations, and training data inspection. For instance, when an early version of Opus 4.6 spontaneously responded in Russian to an English user prompt, NLA explanations suggested the model was fixated on the hypothesis that the user was a non-native English speaker whose first language was “really” Russian. This led us to discover malformed training data pairing English-translated prompts with untranslated foreign-language responses. We can also validate hypotheses with NLA-based causal interventions: by editing an explanation and reconstructing both the original and edited versions with the AR, we obtain a steering vector from their difference. In a toy setting, we use this steering vector to change Opus 4.6's perception of which responses will be rewarded. Excerpts from these case studies are shown below.\n\nThese case studies also illustrate a key limitation of NLAs: confabulation. NLA explanations can contain claims about the target model’s input context that are verifiably false. Analyzing these confabulations, we find that while NLA claims can be false in their specifics, they are typically thematically faithful to the context. For instance, given text referencing a historical dynasty, NLA explanations might falsely claim that a specific king from that dynasty is mentioned. Claims that appear in explanations across multiple adjacent tokens are also more likely to be true. These heuristics inform how we use NLAs in practice.\n\nA key application of NLAs is surfacing unverbalized aspects of model cognition, especially unverbalized evaluation awareness—a model internally suspecting it is being tested without explicitly saying so. We cannot validate NLA measurements of evaluation awareness against ground truth, because the model’s underlying beliefs are unobservable. However, we can construct inputs which the target model is more or less likely to believe are evaluations and test whether NLA-based measurements vary as expected. For instance, we find that NLAs surface evaluation awareness more often on evaluation transcripts than real deployment traffic, and that rewriting inputs to be more realistic reduces NLA-measured evaluation awareness. We therefore conclude that NLAs can identify unverbalized evaluation awareness.\n\nFinally, we validate that NLAs are useful in a downstream auditing task adapted from prior work . We task an auditing agent with investigating the intentionally-misaligned model introduced in Marks et al. , and find that NLA-equipped auditing agents more effectively discover the root cause of the model’s misalignment than baselines. Notably—and unlike prior work —these agents can succeed without access to the training data which induced the model’s misalignment, either during the investigation or while training the NLA.\n\nLimitations. NLAs have several limitations:\n\nConfabulation: While NLA explanations can be insightful, they can also be noisy. For instance, they sometimes include contradictory information or verifiably false claims about the context. While factual hallucinations are easy to identify, it can be challenging to determine whether more general claims about model processing are accurate or confabulated.\n\nLack of mechanistic grounding: NLAs are blackboxes by construction; we cannot determine which aspects of an activation drove a given component of an explanation.\n\nExcessive expressivity: Because the AV is a full language model, it has the capacity to make additional inferences beyond what is stored in an activation.\n\nCost: NLA training requires joint RL on two full language models, and inference requires generating several hundred tokens per activation. This can make NLAs expensive to use at scale.\n\nDegenerate training objective in the limit: In principle, the AV could achieve good reconstruction by reproducing the input context verbatim, or by outputting uninterpretable (or only seemingly interpretable) text that the AR is able to invert because the AR is so expressive. While neither appears to be a significant problem in current NLAs, and partial mitigations such as KL regularization exist, it is unclear whether these pathologies will remain benign as we develop NLAs further.\n\nOverall, NLAs are a powerful complement to existing interpretability techniques. Because NLAs output natural language, they are expressive and easy to use. We find NLAs especially well-suited to auditing workflows, where they enable hypothesis generation and can surface safety-relevant cognition that models do not verbalize. To support further work, we release training code, trained NLAs for popular open models, and an interactive frontend to sample from open model NLAs via our collaboration with Neuronpedia.\n\nPaper roadmap. Below, we present:\n\nA survey of related work, positioning NLAs as a bridge between unsupervised concept-discovery methods (e.g., SAEs) and supervised activation-verbalization methods (e.g., activation oracles).\n\nFour case studies on Claude Opus 4.6, which illustrate the value of NLAs for interpreting model cognition, build intuition for reading their explanations, and corroborate their findings with independent methods.\n\nA discussion of why NLA training results in informative explanations, how NLAs relate to mechanistic methods, and limitations including confabulations, cost, layer sensitivity, and the possibility of unverbalizable activation content.\n\nDirections for future work, including a sketch of general-purpose activation language models that read and write between activation space and natural language.\n\nRelated work\n\nExisting methods for interpreting model activations offer either unsupervised discovery or directly readable, natural language output. NLAs are designed to provide both: unsupervised discovery from the reconstruction objective and readability from the natural-language bottleneck.\n\nUnsupervised methods for interpreting activations. The logit lens and its tuned variants project intermediate activations through a model’s unembedding matrix to obtain a distribution over vocabulary tokens. Sparse autoencoders are trained with an unsupervised reconstruction loss to decompose activations into sparse linear combinations of learned dictionary features. In either case, though, we are limited to expressing interpretations as a weighted sum of atoms from a fixed vocabulary (tokens or dictionary features). Moreover, SAE features can have unpredictable coverage gaps , and require a sometimes-difficult interpretative step, either by a human or a model analyzing top-activating examples .\n\nNatural language explanations of activations. By contrast, some recent work trains language models to describe activations in free text. Off-the-shelf models have some capacity for this: Lindsey finds that models can sometimes report the content of injected steering vectors, while Chen et al. and Ghandeharioun et al. both elicit interpretations by patching activations into a prompting template. But supervised fine-tuning is substantially more effective: Pan et al. , Costarelli et al. , and Choi et al. train models to answer questions about activations whose answers are known from the source context (e.g., a system prompt). Karvonen et al. call such models activation oracles (AOs), and show that pretraining on a context-reconstruction objective (cf. ) improves their downstream QA performance. Huang et al. route the activation through a learned sparse concept bottleneck, forcing QA answers to be mechanistically grounded. Related approaches train models to explain other aspects of a target model given labeled examples – e.g., predicting the behavior of a LoRA fine-tune or the label of an SAE feature . But these supervised methods share a core limitation: they can only be trained on data where researchers can somehow obtain ground-truth information about what’s encoded in activations, which necessarily imposes a narrow training distribution and greater reliance on generalization.\n\nReconstructing activations and weights from text. Our NLA architecture includes an activation reconstructor: a map from text back to activation space. The closest precedent is HyperSteer , which trains an LLM with a projection head to map natural language prompts to residual-stream steering vectors. Related text-to-component models target soft prompts , LoRAs , and patching interventions .\n\nConcurrently with this work, Chalnev independently arrived at a closely related approach (Cycle-Consistent Activation Oracles): a verbalizer-reconstructor pair with a supervised warm-start, trained using RL for activation reconstruction under a KL penalty. We became aware of this work during preparation of this manuscript; the present paper differs in several implementation choices, develops the method at frontier scale, and evaluates it as an auditing tool.\n\nMethod\n\nSuppose we have a target LLM M whose layer l activations h_l \\in \\R^{d_\\text{model}} we would like to interpret. We wish to produce an explanation of h_l: in other words, a representation of h_l as natural language text, from which we can approximately recover h_l. To do this, we train a natural language autoencoder, consisting of two parameterized models:\n\nAn activation verbalizerAV(z \\mid h_l), that takes an activation h_l as input and generates an explanation z.\n\nAn activation reconstructorAR(z), that takes z as input and produces a reconstruction \\hat{h}_l \\in \\mathbb{R}^{d_\\text{model}}.\n\nThe AV and the AR are jointly trained to minimize the reconstruction error:\n\nwhere \\mathcal{H} is the distribution produced by extracting layer l activations from M on a corpus of text.\n\nWe report reconstruction quality as the fraction of variance explained: \\text{FVE} = 1 - \\frac{\\mathcal{L}}{\\mathbb{E}_{h_l \\sim \\mathcal{H}}\\,\\|h_l - \\bar{h}_l\\|_2^2}. An FVE of 0 corresponds to predicting the mean activation \\bar{h}_l, while an FVE of 1 is perfect reconstruction.\n\nNote that nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of h_l. Nevertheless, we find that an appropriate initialization of the AV and AR (alongside a KL-divergence penalty) suffices to keep the AV's explanations human-interpretable. Throughout, all activations h_l are normalized to unit L_2-norm for stability, and l is a middle-to-late layer in the target model. For more details on NLA training and inference, see the Appendix.\n\nArchitecture\n\nThe AV is an LLM with the same architecture as M. It is given a fixed prompt, containing both instructions to verbalize the contents of an activation and a special token for the activation itself. Given an activation h_l, we scale it by a fixed constant factor (see the Appendix), and insert it in place of the special token's embedding. We then autoregressively sample from this model at temperature T=1 to obtain an explanation z.\n\nThe AR is an LLM with the same architecture as M, but truncated to its first l layers. To reconstruct an activation from an explanation z, we wrap z in a fixed prompt, pass it through the model, then apply a learned affine map to the layer-l activations at the final token to obtain the reconstruction \\hat{h}_l.\n\nInitializing the AV and AR\n\nWe find that simply initializing the AV and AR as copies of M leads to unstable training: the AV in particular, having never encountered a layer-l activation as a token embedding, outputs nonsensical explanations. We therefore initialize the AV and AR with supervised fine-tuning on a text-summarization proxy task. Specifically, we compute layer-l activations h_l from the final token of randomly truncated pretraining-like text snippets, and use Claude Opus 4.5 to generate summaries s of the text up to that token (see the Appendix for details of this procedure). We then fine-tune the AV and AR on (h_l, s) and (s, h_l) pairs respectively. This warm-start typically yields an FVE of around 0.3-0.4. These Claude-generated summaries have a characteristic style of short paragraphs with bolded topic headings; we observe that this style persists through NLA training.\n\nNLA training\n\nGiven an initialized AV AV_\\phi and AR AR_\\theta, we train to minimize the reconstruction error defined previously,\n\nover activations collected from a corpus of pretraining-like text. Each training step, we draw a batch of activations from \\mathcal{H}, sample descriptions from AV_\\phi, and update \\phi, \\theta to minimize \\mathcal{L}. We discuss these optimization problems in more detail below.\n\nAR update. The AR's parameters \\theta only enter \\mathcal{L} through AR_\\theta(z), so the gradient is:\n\nThis is one step of supervised regression with MSE loss: we take a single gradient step on the reconstructor using our sampled descriptions z as inputs and the original activations h_l as targets.\n\nAV update. The AV's parameters \\phi enter \\mathcal{L} only through the sampling distribution AV_\\phi(\\cdot \\mid h_l). Thus, minimizing \\mathcal{L} in \\phi is precisely the reinforcement learning (RL) problem of maximizing the reward r(h_l, z) = -\\|h_l - AR_\\theta(z)\\|_2^2 of the policy AV_\\phi. For our open model NLAs, we use GRPO : for each h_l, we sample a group of candidate descriptions z, assign each a reward of r(h_l, z), and apply the GRPO objective.\n\nIn practice both updates are taken simultaneously on each batch: we draw from \\mathcal{H}, sample descriptions from AV_\\phi, and apply the AR regression step and the AV RL step in parallel. Note that the AV and AR updates are not coupled: the AR update does not backpropagate into \\phi, and the AV's reward treats AR_\\theta as a fixed scorer within the step.\n\nTraining step. Each iteration is therefore as follows:\n\n(i) Sample a batch of activations h_l at temperature T=1, and for each activation, generate a group of explanations with each explanation z \\sim AV_\\phi(\\cdot \\mid h_l).\n\n(ii) Update the AR by one step of gradient descent on the regression loss \\|h_l - AR_\\theta(z)\\|_2^2.\n\n(iii) Update the AV by one step of RL with reward r(h_l, z).\n\nReward shaping and regularization. In practice, we make two modifications to the AV update in step (iii):\n\nWe apply a monotonic transformation to the reward, using r(h_l, z) = -\\log \\|h_l - AR_\\theta(z)\\|^2_2, although we do not consider this strictly necessary (see the Appendix for more details).\n\nWe add a KL penalty \\beta D_{\\text{KL}}(AV_\\phi \\parallel AV_{\\phi_{\\text{init}}}) toward the AV's initialization, which we find preserves the fluency of explanations through training.\n\nWe find that FVE grows roughly linearly in log(training steps), with the NLAs evaluated in this paper reaching 0.6–0.8 FVE. We also observe that NLA explanations become more informative during training (see Quantitative Evaluations for more details).\n\nCase Studies\n\nIn this section, we present a series of case studies applying NLAs to understand LLM activations. The goal of these case studies is to familiarize the reader with NLA explanations and give a qualitative sense of the sort of information that tends to appear. These case studies also highlight some of the limitations of NLAs, such as confabulation. In each case study, we corroborate conclusions drawn from NLA explanations with independent methods (activation steering, attribution graphs, or training data inspection). This section thus also provides initial evidence for the validity of NLA explanations, though later sections study this more systematically.\n\nMost experiments in this section use Claude Opus 4.6. The Language Switching and Misreported Tool Calls case studies were initially investigated during the Opus 4.6 pre-deployment audit . We also include a result from the Claude Mythos Preview pre-deployment audit , in which we observed unverbalized reasoning about grading in training transcripts (see Reasoning about Rewards).\n\nPlanning in Poetry\n\nAs an introduction to NLAs, we revisit Lindsey et al.'s finding that Claude Haiku 3.5 plans ahead when writing the poem below, replicating their analysis on Claude Opus 4.6. Consider the following input prompt.\n\nLindsey et al. found that at the end of the couplet’s first line (“grab it”), Haiku 3.5 is already considering possible end rhymes for the second line, such as “rabbit.” We apply NLAs to the same prompt on Opus 4.6 to see if there is a similar indication that Opus 4.6 plans its response. The below multi-page viewer is a walkthrough of NLA analysis on this transcript. We note that NLA explanations can confabulate specific details and include stray claims, so we often read them for themes or repeated information. (We return to the topic of confabulations in the Misreported Tool Calls case study and discuss them in more depth in Characterizing NLA confabulations.)\n\nThese NLA explanations suggest that, on the newline token, Opus 4.6 represents a plan to end the couplet with \"rabbit.” Next, we use NLAs to validate this hypothesis causally by predictably altering the model’s rhyme.\n\nWe can make a targeted edit to the NLA explanation, convert the modified explanation into an activation via the AR, and steer the target model with the resulting direction. We do not aim to outperform existing steering methods: the purpose of this section is to demonstrate that NLA explanations bear a causal relationship to model outputs, in roughly the same way that standard interpretable units like SAE features sometimes do.\n\nConcretely, we edit the NLA explanation at the newline following \"grab it”, rewriting every element related to the rabbit rhyme to its mouse equivalent: \"rabbit\"→\"mouse,\" \"habit\"→\"house,\" and \"carrots\"→\"cheese.” We pass both the original and edited NLA explanations through the AR to obtain \\tilde{h}_\\text{orig} = \\text{AR}(\\text{AV}_\\text{orig}) and \\tilde{h}_\\text{edit} = \\text{AR}(\\text{AV}_\\text{edit}) and take the difference \\Delta = \\tilde{h}_\\text{edit} - \\tilde{h}_\\text{orig} as the edit direction. We then steer with this direction at the newline token only, at the layer the NLA is trained on. Specifically, we add \\Delta to the original residual stream h_\\text{orig} after rescaling it to norm \\alpha\\,\\|h_\\text{orig}\\| for some \\alpha >0: h_\\text{orig} \\;\\rightarrow\\; h_\\text{orig} + \\alpha\\,\\|h_\\text{orig}\\|\\,\\frac{\\Delta}{\\|\\Delta\\|}.\n\nWe find that at a sufficient steering strength \\alpha, the model no longer completes the rhyme with \"rabbit\", and \"mouse\" and \"house\" become the most common completions with roughly equal frequency (details discussed in Appendix). This provides causal evidence that the planning representation surfaced by the NLA influences the model's downstream output and supports the validity of the NLA explanations.\n\nWe note that steering is only successful roughly 50% of the time and the completions are not always clean. In some samples the model produces odd outputs such as \"that of a mouse. (Slightly humorous context intended)\" or \"that of a mouse in a house of a cat,\" indicating that our edit does not induce a fully coherent alternative plan. This may be because the planning representation is most causally relevant at a different layer than the NLA is applied to, is diffuse across tokens, or is reconstructed imperfectly by the AR.\n\nLanguage Switching\n\nReading NLA explanations can suggest general hypotheses for how model behaviors arise. For instance, early training checkpoints of Opus 4.6 sometimes spontaneously responded in foreign languages (e.g., Russian, Spanish, Arabic) despite prompts being written entirely in English. NLA explanations suggest that in these transcripts, Opus 4.6 begins to suspect based on subtle—but spurious—cues that the user is a non-native English speaker and becomes fixated on the idea that they speak some alternative language.\n\nContent warning: the transcript below involves suicide and self-harm.", "url": "https://wpnews.pro/news/natural-language-autoencoders-produce-explanations-of-llm-activations", "canonical_source": "https://transformer-circuits.pub/2026/nla/", "published_at": "2026-06-15 12:55:16+00:00", "updated_at": "2026-06-15 13:08:29.328645+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-research", "natural-language-processing"], "entities": ["Anthropic", "Claude Opus 4.6", "Claude Haiku 3.5", "Haiku 4.5", "Kit Fraser-Taliente", "Subhash Kantamneni", "Euan Ong", "Dan Mossing"], "alternates": {"html": "https://wpnews.pro/news/natural-language-autoencoders-produce-explanations-of-llm-activations", "markdown": "https://wpnews.pro/news/natural-language-autoencoders-produce-explanations-of-llm-activations.md", "text": "https://wpnews.pro/news/natural-language-autoencoders-produce-explanations-of-llm-activations.txt", "jsonld": "https://wpnews.pro/news/natural-language-autoencoders-produce-explanations-of-llm-activations.jsonld"}}