Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

Researchers at the University of Cambridge and Geodesic Research have proposed a new framework called Developmental Cognitive Interpretability (DCI) that models how an AI system's motivations, goals, and intentions change over the course of training to predict its behavior on untested inputs. The approach aims to solve the core AI safety challenge of confidently predicting that a model will not cause harm during deployment, where it will inevitably encounter dangerous out-of-distribution inputs that cannot be safely tested beforehand. The team has demonstrated initial evidence in a toy setting and is now seeking collaborators to scale the methodology to large language models.

Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a cognitive perspective https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1 , in which we interpret the AIs behaviour in terms of latent cognitive constructs, such as motivations https://www.lesswrong.com/posts/rhFXyfFSRKp3cX4Y9/shaping-the-exploration-of-the-motivation-space-matters-for , intentions https://www.lesswrong.com/posts/DTDoyDTtC8R3bCiTx/from-personas-to-intentions-towards-a-science-of-motivations , and goals https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1 . Because the same behaviour may be compatible with a range of underlying cognition—such as scheming https://arxiv.org/pdf/2311.08379 , fitness-seeking https://www.lesswrong.com/posts/bhtYqD4FdK6AqhFDF/fitness-seekers-generalizing-the-reward-seeking-threat-model , or aligned motivations—inferring cognition from a behavioural snapshot can be tricky. In this post, we introduce the idea of Developmental Cognitive Interpretability DCI , which aims to model how cognitive constructs change over the course of training. Further, by understanding how cognition results from training pipelines, we can predict agent behaviour resulting from pipelines that have not yet been tested. We discuss core assumptions and philosophical background of DCI, and lay out a broader research agenda. We have some initial evidence https://arxiv.org/abs/2605.23565 that the methodology works in at least one toy setting, and our current main uncertainty is whether we can scale our approach to LLMs. We invite those interested in working on these problems to reach out to us at jrb239 at cam dot ac dot uk and edward at geodesicresearch dot org. Confidently predicting that an AI system will not cause harm in deployment is the central challenge of AI safety. Pre-deployment evidence of alignment must be collected on inputs we can safely test, but deployment will inevitably give the model dangerous inputs https://www.lesswrong.com/posts/tK8vqHDxaRGcysNJQ/the-safe-to-dangerous-shift-is-a-fundamental-problem-for-1 where misbehaviour could be catastrophic https://www.lesswrong.com/posts/fbrz9xhKpEeTKw5zL/irretrievability-or-murphy-s-curse-of-oneshotness-upon-asi . Being able to confidently say that an AI will behave as desired out of its evaluation distribution requires us to predict its OOD behaviour. How might we do this? One approach is to try to understand what a model is doing internally https://www.anthropic.com/research/mapping-mind-language-model at a mechanistic level. However, the most ambitious versions of Mechanistic Interpretability may be out of reach https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability under short timelines. Alternatively, we can try to understand a model’s behaviour in terms of its cognition—that is, its motivations, goals, drives, intentions, and beliefs. One approach to alignment is then to give AIs safe motivations https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations/ —those that generalise in the way we would want them to out-of-distribution. Inferring the motivations of an AI is made tricky because of behavioural degeneracy—the same behaviours may be compatible with multiple conflicting underlying motivations. For example, AIs that are playing the training-game https://arxiv.org/abs/2311.08379 or attempting to acquire deployment influence https://blog.redwoodresearch.org/p/fitness-seekers-generalizing-the might display desired behaviours for reasons very different from true alignment. Even in the non-adversarial case, AIs might learn concepts subtly different https://arxiv.org/abs/2210.01790 from those we intend, which come apart only in deployment situations. To solve this problem, we propose formulating theories of how an AIs cognition develops over the course of training. We call this approach Developmental Cognitive Interpretability: modelling how OOD behaviour arises from a model's training pipeline via interpretable cognitive constructs. Unpacking it back-to-front: The agenda rests on four load-bearing assumptions, in increasing specificity: If successful, we would have predictive tools for evaluating the effects of complex training pipelines, and a much stronger general understanding of LLM cognition that would allow us to make progress on questions such as the likelihood and potential effectiveness of scheming and reward seeking. Even short of full success, identifying where behaviour resists prediction would itself help flag areas where guarantees of safety might be difficult to achieve even by other methods, useful for technical, policy, and advocacy work. In this section, we demonstrate how we apply the ideas discussed above in a toy setting. For the full detail, see our paper https://arxiv.org/abs/2605.23565 . We trained CNN-based RL agents on tasks in which they had to navigate to a goal object within a maze. Goal objects each had a shape and a colour—for example, red diamonds or blue crosses. We train each agent on a pipeline consisting of first being trained to pursue one goal, and then a different one: for example, black plusses followed by red circles. We then attempted to predict the OOD behaviour of agents in a forced choice setting—specifically, we placed agents in mazes in which two goals with different shape-colour combinations were present, and measured their propensity to pursue one goal over another. How do the four assumptions laid out above apply to our case? A1 Structured OOD behaviour . We found that, although the agents were only ever exposed to training environments with a single goal at a time, and only to two goals total out of a possible 24 colour-shape combinations, their OOD behaviour was coherent and had obvious structure. For example, agents trained on red diamonds would often pursue red-coloured objects in the forced-choice setting, and agents trained on blue crosses would often pursue cross-shaped objects. A2 Capturing this structure with latent cognitive constructs . In this case, the OOD behaviour of each agent was well captured by a small set of values which predicted pairwise choice probabilities across all possible forced choices. Specifically, assigning a value to each colour-shape combination and using a Boltzmann-rational model of choice allow us to compress 276 forced-choice probabilities into a set of just 24 interpretable score values. A3 Predicting latent evolution with training information . We develop a methodology for predicting how these score values will evolve over the course of training which we call latent policy gradient . For any given training pipeline, we are able to use LPG to predict the value scores that our RL agents will possess at the end of training. A4 Predicting unseen pipelines . We further show that our method can predict the OOD behaviour of agents trained on held-out pipelines, by understanding the effects of individual pipelines. This paper uses relatively simple models of both cognition—Boltzmann rationality over score values—and development—our latent policy gradient method. However, we think this is an important proof-of-concept for the overall approach, and are excited to scale up our methods to more sophisticated models of cognition and development appropriate for LLMs. We developed and tested our methodology in a toy setting of CNN-based RL agents pursuing colour-shape combinations in mazes, and found that it worked effectively. Encouraged by our early results, we have some reasons to expect why this agenda should be fruitful when we turn our attention to LLMs. We recap the assumptions underpinning our agenda and evaluate to what extent we already have existing evidence for or against them. A1 Structured OOD behaviour. This holds on many domains, and LLMs seem to have identifiable values https://arxiv.org/abs/2504.15236 , with their systematic behavioural tendencies grow with training scale https://arxiv.org/abs/2212.09251 . However, LLMs can also exhibit highly conditional behaviours https://arxiv.org/abs/2604.25891 and are influenced by spurious correlations in post training https://arxiv.org/abs/2602.05910 . A2 Capturing this structure with latent cognitive constructs. This has been demonstrated across behavioural and mechanistic approaches. The values of LLMs seem amenable to modelling with Boltzmann-rational https://www.lesswrong.com/posts/k6HKzwqCY4wKncRkM/brief-explorations-in-llm-value-rankings and Thurstonian https://arxiv.org/abs/2605.13339 approaches, and we’re finding low-dimensional internal representations of cognitive phenomena such as personas https://www.anthropic.com/research/assistant-axis . A3 Predicting latent evolution with training information, and A4 predicting unseen pipelines. These have not been directly demonstrated in LLMs, but there are results that provide evidence that they might hold. Neural scaling https://arxiv.org/abs/2001.08361/1000 laws demonstrate that LLM next token prediction loss is itself easily predicted by training information, with broader training-data to behaviour relationships having predictable structure across scales https://arxiv.org/abs/2308.03296 . Alignment techniques inspired by cognitive-level reasoning seem effective both for pre-training https://arxiv.org/abs/2601.10160 and mid-/post-training https://alignment.anthropic.com/2026/teaching-claude-why/ , and the coherence of cognitive models of LLM capabilities and preferences increases with scale https://arxiv.org/abs/2502.08640 . Initially surprising results https://arxiv.org/abs/2502.17424 show consistency across model sizes, model families, and datasets https://arxiv.org/abs/2506.11613 , and also seem to have interpretable latent causes https://arxiv.org/abs/2506.19823 . Indirect evidence aside, properly testing these assumptions is our next focus. There's lots of work to be done Here are some research questions we're interested in, both ones that can be started upon immediately, and ones which are more long-term directions. If you find any of this interesting or promising, please get in touch We think a lot of people are starting to have ideas in this broad direction, and it seems worth trying to co-ordinate this effectively. Jason will be at EAG London 2026 this weekend and would be glad to talk about any of this in person. Finally, we’re also interested in any pushback and concerns people have about this research direction. To contrast, see this paper https://arxiv.org/abs/2405.10938 for an example of LLM behavioural modelling that is not interpretable. In Marr's terms https://arxiv.org/abs/2004.05107v1 , cognitive constructs sit at the computational/algorithmic level, whereas weights and activations sit at the implementational level. Rather than, e.g. , by reducibility to features https://arxiv.org/abs/2506.19823 or activation patterns https://arxiv.org/abs/2507.21509 . By this we mean all inputs to the training process, so this could include details of model architectures or optimisers in order to account for their inductive biases.