What did "scheming" and "mech interp" mean pre-2023?

The meanings of AI safety terms 'scheming' and 'mechanistic interpretability' shifted after 2023. 'Scheming' originally referred to training-gaming for out-of-context goals (now 'alignment faking'), but now denotes in-context goal pursuit during deployment. 'Mech interp' originally meant reverse-engineering neural network internals (now 'ambitious mech interp'), but now broadly covers any internal analysis for understanding behavior.

This was too long to be a short-form, but it should really be a short-form. This notice is useful for people who've recently got into AI safety, who want to engage with the ancient texts i.e. pre-2024 . If you were around before 2023, then you probably don't need this. A few phrases have changed their meaning over time. Two examples that came to mind recently are scheming and mech interp . In both cases, I think the change-of-terminology was reasonable. There are probably a bunch of other examples — feel free to mention them in the comments. This used to mean "training-gaming in pursuit of out-of-context goals". For example, Carlsmith Nov 2023 https://arxiv.org/abs/2311.08379 starts with: This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" also sometimes called "deceptive alignment". Then Apollo came out with Frontier Models are Capable of In-context Scheming" Dec 2024 https://arxiv.org/abs/2412.04984 : We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow. So the difference here is 1 the AI is isn't in training it's in testing or deployment and 2 the goals are acquired in-context rather than being preserved between instances . What used to be called "scheming" is now called "alignment faking". If you read ancient texts, and they say something like "bla kinds of models will not be capable of scheming" then they might mean the Carlsmith concept, rather than the Apollo concept. In general, the Carlsmith concept is probably more worrying, harder to catch, and requires more capable models. I think what we now call "scheming", would probably have been called "instrumental covergence of power-seeking and deception". Originally, this meant reverse-engineering the internal representations and mechanisms in a neural network. For example, here's Neel Nanda's Grokking paper https://arxiv.org/abs/2301.05217 : We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. Jan 2023 Or here's Redwood's paper on Indirect Object Identification https://arxiv.org/abs/2211.00593 : Work in mechanistic interpretability aims to discover, understand, and verify the algorithms that model weights implement by reverse engineering model computation into human-understandable components. Nov 2022 I think mech interp now means any technique that involves looking at the internals weights or activatons or a model in order to understand/explain/predict its behavious. For example, see A Pragmatic Vision for Interpretability https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability . There's no field consensus on what mechanistic interpretability actually is, but we've found this definition useful: - Mechanistic: about model internals weights and activations - Interpretability: about understanding or explaining a model's behavior - This could be a particular instance of behaviour, to more general questions about how the model is likely to behave on some distribution - Mechanistic interpretability: the intersection, i.e. using model internals to understand or explain behavior What used to be called "mech interp" is now called "ambitious mech interp". If you read the ancient texts, and they say something like "mech interp would allow us to detect deceptive alignment" then they might be talking about "reverse engineering model computation into human-understandable components", not "using model internals to understand or explain behavior". I think what we now call "mech interp", used to be called "transparency techniques".