Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a cognitive perspective, in which we interpret the AIs behaviour in terms of latent cognitive constructs, such as motivations, intentions, and goals. Because the same behaviour may be compatible with a range of underlying cognition—such as scheming, fitness-seeking, or aligned motivations—inferring cognition from a behavioural snapshot can be tricky. In this post, we introduce the idea of Developmental Cognitive Interpretability (DCI), which aims to model how cognitive constructs change over the course of training. Further, by understanding how cognition results from training pipelines, we can predict agent behaviour resulting from pipelines that have not yet been tested.
We discuss core assumptions and philosophical background of DCI, and lay out a broader research agenda. We have some initial evidence that the methodology works in at least one toy setting, and our current main uncertainty is whether we can scale our approach to LLMs. We invite those interested in working on these problems to reach out to us at jrb239[at]cam[dot]ac[dot]uk and edward[at]geodesicresearch[dot]org.
Confidently predicting that an AI system will not cause harm in deployment is the central challenge of AI safety. Pre-deployment evidence of alignment must be collected on inputs we can safely test, but deployment will inevitably give the model dangerous inputs where misbehaviour could be catastrophic. Being able to confidently say that an AI will behave as desired out of its evaluation distribution requires us to predict its OOD behaviour.
How might we do this? One approach is to try to understand what a model is doing internally at a mechanistic level. However, the most ambitious versions of Mechanistic Interpretability may be out of reach under short timelines. Alternatively, we can try to understand a model’s behaviour in terms of its cognition—that is, its motivations, goals, drives, intentions, and beliefs. One approach to alignment is then to give AIs safe motivations—those that generalise in the way we would want them to out-of-distribution.
Inferring the motivations of an AI is made tricky because of behavioural degeneracy—the same behaviours may be compatible with multiple conflicting underlying motivations. For example, AIs that are playing the training-game or attempting to acquire deployment influence might display desired behaviours for reasons very different from true alignment. Even in the non-adversarial case, AIs might learn concepts subtly different from those we intend, which come apart only in deployment situations.
To solve this problem, we propose formulating theories of how an AIs cognition develops over the course of training. We call this approach Developmental Cognitive Interpretability: modelling how OOD behaviour arises from a model's training pipeline via interpretable cognitive constructs. Unpacking it back-to-front:
The agenda rests on four load-bearing assumptions, in increasing specificity:
If successful, we would have predictive tools for evaluating the effects of complex training pipelines, and a much stronger general understanding of LLM cognition that would allow us to make progress on questions such as the likelihood and potential effectiveness of scheming and reward seeking. Even short of full success, identifying where behaviour resists prediction would itself help flag areas where guarantees of safety might be difficult to achieve even by other methods, useful for technical, policy, and advocacy work. In this section, we demonstrate how we apply the ideas discussed above in a toy setting. For the full detail, see our paper.
We trained CNN-based RL agents on tasks in which they had to navigate to a goal object within a maze. Goal objects each had a shape and a colour—for example, red diamonds or blue crosses. We train each agent on a pipeline consisting of first being trained to pursue one goal, and then a different one: for example, black plusses followed by red circles. We then attempted to predict the OOD behaviour of agents in a forced choice setting—specifically, we placed agents in mazes in which two goals (with different shape-colour combinations) were present, and measured their propensity to pursue one goal over another.
How do the four assumptions laid out above apply to our case?
(A1) Structured OOD behaviour. We found that, although the agents were only ever exposed to training environments with a single goal at a time, and only to two goals total out of a possible 24 colour-shape combinations, their OOD behaviour was coherent and had obvious structure. For example, agents trained on red diamonds would often pursue red-coloured objects in the forced-choice setting, and agents trained on blue crosses would often pursue cross-shaped objects.
(A2) Capturing this structure with latent cognitive constructs. In this case, the OOD behaviour of each agent was well captured by a small set of *values *which predicted pairwise choice probabilities across all possible forced choices. Specifically, assigning a value to each colour-shape combination and using a Boltzmann-rational model of choice allow us to compress 276 forced-choice probabilities into a set of just 24 interpretable score values.
(A3) Predicting latent evolution with training information. We develop a methodology for predicting how these score values will evolve over the course of training which we call latent policy gradient. For any given training pipeline, we are able to use LPG to predict the value scores that our RL agents will possess at the end of training.
(A4) Predicting unseen pipelines. We further show that our method can predict the OOD behaviour of agents trained on held-out pipelines, by understanding the effects of individual pipelines.
This paper uses relatively simple models of both cognition—Boltzmann rationality over score values—and development—our latent policy gradient method. However, we think this is an important proof-of-concept for the overall approach, and are excited to scale up our methods to more sophisticated models of cognition and development appropriate for LLMs.
We developed and tested our methodology in a toy setting of CNN-based RL agents pursuing colour-shape combinations in mazes, and found that it worked effectively. Encouraged by our early results, we have some reasons to expect why this agenda should be fruitful when we turn our attention to LLMs. We recap the assumptions underpinning our agenda and evaluate to what extent we already have existing evidence for or against them.
(A1) Structured OOD behaviour. This holds on many domains, and LLMs seem to have identifiable values, with their systematic behavioural tendencies grow with training scale. However, LLMs can also exhibit highly conditional behaviours and are influenced by spurious correlations in post training.
(A2) Capturing this structure with latent cognitive constructs. This has been demonstrated across behavioural and mechanistic approaches. The values of LLMs seem amenable to modelling with Boltzmann-rational and Thurstonian approaches, and we’re finding low-dimensional internal representations of cognitive phenomena such as personas.
**(A3) Predicting latent evolution with training information, **and (A4) predicting unseen pipelines. These have not been directly demonstrated in LLMs, but there are results that provide evidence that they might hold. Neural scaling laws demonstrate that LLM next token prediction loss is itself easily predicted by training information, with broader training-data to behaviour relationships having predictable structure across scales. Alignment techniques inspired by cognitive-level reasoning seem effective both for pre-training and mid-/post-training, and the coherence of cognitive models of LLM capabilities and preferences increases with scale. Initially surprising results show consistency across model sizes, model families, and datasets, and also seem to have interpretable latent causes. Indirect evidence aside, properly testing these assumptions is our next focus.
There's lots of work to be done! Here are some research questions we're interested in, both ones that can be started upon immediately, and ones which are more long-term directions.
If you find any of this interesting or promising, please get in touch! We think a lot of people are starting to have ideas in this broad direction, and it seems worth trying to co-ordinate this effectively. Jason will be at EAG London 2026 this weekend and would be glad to talk about any of this in person. Finally, we’re also interested in any pushback and concerns people have about this research direction.
To contrast, see this paper for an example of LLM behavioural modelling that is not interpretable.
In Marr's terms, cognitive constructs sit at the computational/algorithmic level, whereas weights and activations sit at the implementational level.
Rather than, e.g., by reducibility to features or activation patterns. By this we mean all inputs to the training process, so this could include details of model architectures or optimisers in order to account for their inductive biases.