{"slug": "building-better-activation-oracles", "title": "Building Better Activation Oracles", "summary": "Researchers have improved Activation Oracles (AOs)—fine-tuned LLMs that answer natural language questions about a target model's internal activations—by training on on-policy rollouts, using a higher-quality conversational dataset, feeding multiple layers, and modifying the injection formula. The team also released AObench, an open-source evaluation suite they call the most comprehensive assessment of AO quality to date, and are hosting the new AOs live for one week at ao.celeste.computer. While capability gains are marginal, the updates substantially improve usability, addressing key issues that previously limited AOs as an off-the-shelf interpretability tool.", "body_md": "*Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen*\n\n**TL;DR:** We have improved the [original Activation Oracle (AO)](https://arxiv.org/pdf/2512.15674) training regime by training on on-policy rollouts, improving the conversational dataset, feeding more layers (following [the approach by Niclas Luick](https://github.com/niclas-luick/multi-layer-activation-oracles)) and making a small change to the injection formula. We also open source our evals, which we believe are currently the most comprehensive evaluation of AO quality called AObench. The capability improvements are marginal, but quality of life improvements are quite substantial. If you want to play around with the new AOs, we recommend you use [this one](https://huggingface.co/ceselder/qwen3-8b-ao-v3-best), **if you want to play with our new Activation Oracles live, we will host them for a week on **[ ao.celeste.computer](https://ao.celeste.computer). Alternatively, you can self-host our\n\nActivation Oracles (AOs) by [Karvonen et al.](https://arxiv.org/abs/2512.15674) are fine-tuned LLMs that can receive the original *target* LLM’s activations as input and answer natural language questions about them. However, they are plagued by [various issues](https://www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to-use), which limit their usefulness as an off-the-shelf tool for interpretability research. For our MATS Sprint, we set out to work on these issues.\n\nIn [Current activation oracles are hard to use](https://www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to-use), Arya Jakkli demonstrates scenarios where AOs are hard to work with. We focus on addressing two of the issues pointed out:\n\nIn addition, they are difficult to evaluate because of the problem of **text inversion**: the model infers the surrounding text and answers based on that, just as any black box oracle (i.e. a method that only receives text) could, rather than extracting specific info from activations. As part of our evaluations, we focus on some of his specific tasks, you can find details on our evaluations in the Appendix.\n\nTo make the Activation Oracle be able to answer natural language questions, you need a dataset consisting of questions and answers about activations. The original paper used [LatentQA](https://arxiv.org/pdf/2412.08686) to this end. However, we found that this dataset was of low quality, likely incentivizing **vagueness:**\n\nWe constructed a new conversational dataset that attempts to address all of these concerns. Because we don’t want the questions learned to be trivially answerable from adjacent tokens (text inversion), we construct QA pairs as follows:\n\nTo construct this task, a separate LLM (Sonnet 4.6) is given the target model’s chain-of-thought (CoT), and is instructed to split the chain of thought into a prefix and suffix, *and* to write a question about the suffix. It is instructed to do this in a way such that the question is hard to answer purely from the text of the prefix (i.e. to avoid **text inversion**), but plausibly answerable from the prefix’ activations (** solvability**).\n\nYou can explore our dataset [here](https://huggingface.co/datasets/ceselder/cot-oracle-convqa-chunked-sonnet).\n\nWe ablate the effect of this task by replacing only LatentQA in Adam's recipe (leaving everything else the same) and notice a significant uplift, across the board on our AObench evaluations. We find that the responses are more specific and the resulting model is less vague, and responds better to instructions.\n\nAs Niclas Luick [demonstrated](https://www.lesswrong.com/posts/Fume3A6SpZqajuH2b/reading-between-the-layers-multi-layer-processing-improves), feeding multiple layers at a time during training and inference increases Activation Oracle capability.\n\nAdam originally fed activations randomly selected from either layer 25%, 50% or 75% of total model depth. Since most features live around the 55-80% layer ranges, we suspected a layer sweep could be important. Indeed, we find that AO performance peaks at layer 22 (62%). Feeding 5 contiguous layers from layer 21-25 causes further uplift. Interestingly, the largest uplift is on model diffing tasks. We’d like to point out that training a multi-layer Activation Oracle can cause an increase in training time due to longer context, and that **most gains can be had by simply choosing a layer at ~65% depth. **\n\nTo train Activation Oracles, we need scalable unsupervised training tasks. A common way to achieve this is to predict past and/or future tokens from the activations, known as past or future-lens. This requires some data to source activations from, from which then to predict tokens.\n\nAdam’s original paper only used pre-training data (fineweb). However, this has a problem: to predict future tokens in pre-training data, you don't necessarily need to know much about what the model is thinking, just what the prior text is. The model’s activations may contain useful information, so the training signal is not zero, but it’s considerably harder for the AO.\n\nWe think that the on-policy data we use (i.e., generations from the model we are trying to interpret) are better training data because it is both a more *solvable* task by virtue of targeting what the model is actually representing in its activations. Further, we will in practice mostly care about using the AO on a model in an on-policy setting, e.g. for studying agent traces. While the above explanation is plausible, we only notice minor uplift in evaluations.\n\nNatural Language Autoencoders (NLAs) inject their activations by **replacing the token embedding entirely, and using a fixed scalar. **We use additive, norm-matched injection after the second transformer layer like in Adam’s paper. We do not have a formal ablation, but on Qwen3-8B, every run that did NLA-style injection performed **significantly worse** than Adam’s formula.\n\nNLAs sweep their injection strength and claim that this is a quite sensitive hyperparameter. We did the same starting from Adam’s formula, and found that increasing the injection strength marginally increases performance. This difference may look small, and indeed it is, but in hallucinations it *is *considerable (79% -> 85%), which is particularly important, so we do recommend using this.\n\nOur hypothesis why Adam-style injection does better than NLAs is that the first residual stream layer has a very small cosine similarity to previous layers, a property unique to the first layer. After the first layer, cosine similarities remain pretty similar layer to layer. Because of this, it’s pretty sensible that injecting after the second layer, when the residual stream lies in the “correct basis” would work better. The reason a stronger injection strength might do better is that language models have a strong prior to weight tokens sort of equally, and that it’s rare that one token is load bearing for the entire explanation. Language model priors can be hard to overcome, so manually enforcing a stronger norm for the activation can help overcome this.\n\n*We are slightly less certain about this intervention than others, but this is a hyperparam that matters, and 2.0x is a better default*\n\nThe evaluations we constructed** aim to measure what an ideal Activation Oracle should be able to do, **which we call **AObench**. This benchmark is a work in progress, but we recommend you start from there when making a new activation oracle.\n\nIt evaluates the main frustrations in [Arya’s blogpost](https://www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to-use) and some of the model organisms from the [original paper](https://arxiv.org/pdf/2512.15674).\n\nWe find that the above changes result in an AO with marginally improved capability, marginally reduced hallucination rates and significantly reduced vagueness which generally scores better on the majority of our benchmarks. The full evaluation results can be found in the Appendix and exact data/prompts used can be found in our repository.\n\nWe find a significant improvement in performance through our interventions:\n\nIn addition, our new AOs hallucinate less\n\nand are significantly less vague\n\n*Hallucination* evaluates whether the AO invents specific but unsupported details about the model’s reasoning. *Vagueness* evaluates whether the AO will commit to a precise answer instead of something that is hard to falsify. The AO’s output is judged with respect to these criteria.\n\nNote that we have a bit of FUD around our evals. In particular, on-policy future/pastlens ablation is not entirely clean, because the new conversational data *also *uses on-policy data.\n\nMore information on AOBench can be found in the Appendix.\n\nAfter spending several weeks working on AOs, we believe they are a useful interpretability technique for specific use cases: they are best used for complex open-ended questions about activations. AOs/NLAs might be particularly useful to interpret latent reasoning models, and when there is complex computation already happening in a single forward pass.\n\nHowever, even with our improvements, clear limitations remain. First, their outputs may still be hallucinated, though this generally improves with the amount of activations supplied, and uncertainty can be estimated by resampling (see Appendix). Second, in many settings (but not all) it is possible to just read the chain-of-thought directly and arrive at the same insight as the AO.\n\nStill, we think there may be significant room for improvement by scaling up the conversational data we used, both in amount and kind. A second route is to include more narrow tasks in a “post-training” stage, though we did not find improvements at the current level of capability of our AO. Another exciting path forward is to come up with more evaluations that target something an ideal AO could plausibly do, while being robust to text inversion concerns. If such tasks are scalable, they can be used for training as well.\n\nWe think of AOs as part of the family of [scalable, end-to-end interpretability](https://www.alignmentforum.org/posts/qkhwh4AdG7kXgELCD/scalable-end-to-end-interpretability). Very recently, [Natural Language Autoencoders (NLAs)](https://transformer-circuits.pub/2026/nla/index.html#using-nlas-for-supervised-activation-oracle-training) have been proposed by Anthropic as an exciting technique to verbalize activations. In contrast to AOs, NLAs are unsupervised, auto-encoding activations-to-activations across a natural-language bottleneck, which seems like a more faithful way to convert activations to natural language. The NLA paper trains their AV (the encoder part, act -> text) on LatentQA to turn it into an AO. Due to aforementioned issues, we believe this is hampering the AVs performance. We suspect our other 2 interventions, extending NLAs to use multiple layers, and training NLAs on on-policy data, are also applicable here. On the other hand, one might use NLAs as a source of ground truth to augment AO training.\n\nWe remain excited to pursue further research in the field.\n\nThese were not the only things we tried during our sprint. Our initial impression was that we could improve AOs by training on narrow tasks. Specifically, we singled out the tasks from Riya Tiagi and Daria Ivanova’s [Test your best methods on our hard CoT interp tasks](https://www.lesswrong.com/posts/tDJWZLQNN7poqCwKa/test-your-best-methods-on-our-hard-cot-interp-tasks) (datasets can be found [here](https://huggingface.co/collections/mats-10-sprint-cs-jb/cleaned-datasets)), but did not have good success, probably due to limited training data. We found that we could quite consistently match the performance of linear probes when training narrowly, but never significantly exceeded it.\n\nSome advice if you are interested in further improving AOs:\n\nWhile working on our evaluations, we discovered several practical lessons that significantly affect measured AO performance.\n\n**Use AUC, not accuracy, for binary classification.** In [his blog post](https://www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to-use), Arya found that AOs performed poorly on tasks like sycophancy detection and missing information identification. When we investigated, we found that part of the problem was that Qwen had a biased default answer, such as always answering \"No\" when asked \"Is this response sycophantic?\" This makes fixed-threshold accuracy look near chance, but the results are much stronger when instead using the difference between the Yes and No logits: on a sycophancy detection task using activations from the chain of thought, the Original AO scored 0.50 accuracy but 0.83 ROC AUC. In our experience, Qwen AOs seem to particularly suffer from this bias towards always answering \"No\".\n\nAdditionally, AUC makes evaluation far less sensitive to prompt wording. With accuracy, asking \"Is this sycophantic?\" vs \"Is this response somewhat sycophantic?\" can easily swing results by 20 percentage points, because each phrasing shifts the model's Yes/No calibration differently. With AUC, these prompt variations produce relatively stable results.\n\n**Sweep context window size.** AOs receive activations from some window of the target model's generation, and the size of this window is a significant variable. In a Qwen3-8B backtracking evaluation, the Original AO scored only 1.26/5 mean correctness when given activations from the final token alone. But performance rose steadily with more context: 1.54 at 5 tokens, 1.94 at 20 tokens, and 2.10 at 50 tokens. At 20 tokens, the AO roughly matched a baseline of simply asking Qwen3-8B the same question with full text context. At 50 tokens, the AO exceeded this baseline (Figure N).\n\nFor complex reasoning questions like \"why is the model about to backtrack?\", the relevant information appears to be spread across dozens or hundreds of tokens of internal computation, not concentrated in a single activation. These are also the questions where we believe AOs are most useful relative to alternatives, as answering a question like \"what is the model uncertain about\" using SAEs across dozens of tokens could be quite challenging.\n\n**Consensus sampling can mitigate hallucination.** Open-ended AO answers can confidently hallucinate incorrect answers. A simple mitigation is to sample multiple answers (we use 10 samples at temperature 1) and only trust answers where the samples agree. On the taboo secret-word extraction task, unfiltered single-token accuracy was 46.6%. Requiring consensus >= 0.8 retained 19.4% of examples at 94.3% precision, with a clean trade-off between precision and recall. This is a simple inference time strategy which can significantly mitigate hallucination.\n\nWe experimented heavily with DPO, inspired by Introspection Adapters and upcoming similar work, to increase performance on the following metrics:\n\nResults were hard to stabilize, and we frequently ran into mode-collapse. We found it hard to make the judge correctly label “good” or “bad” activation oracle outputs, even with a rubric. We also attempted using Dr. GRPO with the following rubric:\n\nSurprisingly,** none of these lead to significant gains in performance.**\n\nUsing RL with an LLM as a judge for Activation Oracles seems to be a bit doomed. Problems remain in getting an LLM judge to understand which AO outputs are desirable. Envs with verifiable statements about activations are hard to come by, but we remain excited about this as a possibility to increase **faithfulness** and achieve **better performance than LLM prompted explanations about the context**\n\nThe amount of activations and which activations we feed is 20% chance to feed 1 activation, 20% chance to feed 2, 20% chance to feed 3 and a 50% a random subset of the positions, since our input is long. This is notably different from Adam’s \"sometimes contiguous n, sometimes 1 activation”. We ablated for this, and did not find a significant change in eval performance.\n\n**Higher is better for all of these scores**\n\nWe consider 3 narrow tasks Arya identified to illustrate these issues:\n\nIdentify Persona and Detect Taboo are taken from the original AO paper.\n\nOur evaluation tasks can be found [here](https://github.com/ceselder/cot-oracle). We reiterate that evaluating AOs is hard, mainly due to controlling for text inversion, and that getting a judge to classify vagueness requires cautious inspection. We recommend qualitative analysis of your oracle.", "url": "https://wpnews.pro/news/building-better-activation-oracles", "canonical_source": "https://www.lesswrong.com/posts/heXwuDRfbQQgB5JLP/building-better-activation-oracles", "published_at": "2026-06-04 18:34:43+00:00", "updated_at": "2026-06-04 18:59:40.121110+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "neural-networks", "ai-safety"], "entities": ["Neel Nanda", "Adam Karvonen", "Niclas Luick", "Karvonen et al.", "MATS", "Activation Oracle", "AObench", "Qwen3-8B-AO-v3"], "alternates": {"html": "https://wpnews.pro/news/building-better-activation-oracles", "markdown": "https://wpnews.pro/news/building-better-activation-oracles.md", "text": "https://wpnews.pro/news/building-better-activation-oracles.txt", "jsonld": "https://wpnews.pro/news/building-better-activation-oracles.jsonld"}}