What did "scheming" and "mech interp" mean pre-2023?

wpnews.pro

cd /news/ai-safety/what-did-scheming-and-mech-interp-me… · home › topics › ai-safety › article

[ARTICLE · art-41390] src=lesswrong.com ↗ pub=2026-06-26T22:09Z topic=ai-safety verified=true sentiment=· neutral

What did "scheming" and "mech interp" mean pre-2023?

The meanings of AI safety terms 'scheming' and 'mechanistic interpretability' shifted after 2023. 'Scheming' originally referred to training-gaming for out-of-context goals (now 'alignment faking'), but now denotes in-context goal pursuit during deployment. 'Mech interp' originally meant reverse-engineering neural network internals (now 'ambitious mech interp'), but now broadly covers any internal analysis for understanding behavior.

read3 min views1 publishedJun 26, 2026

This was too long to be a short-form, but it should really be a short-form.

This notice is useful for people who've recently got into AI safety, who want to engage with the ancient texts (i.e. pre-2024). If you were around before 2023, then you probably don't need this.

A few phrases have changed their meaning over time. Two examples that came to mind recently are scheming and mech interp. (In both cases, I think the change-of-terminology was reasonable.) There are probably a bunch of other examples — feel free to mention them in the comments.

This used to mean "training-gaming in pursuit of out-of-context goals". For example, Carlsmith (Nov 2023) starts with: This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" (also sometimes called "deceptive alignment".

Then Apollo came out with Frontier Models are Capable of In-context Scheming" (Dec 2024): We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow.

So the difference here is (1) the AI is isn't in training (it's in testing or deployment) and (2) the goals are acquired in-context (rather than being preserved between instances).

What used to be called "scheming" is now called "alignment faking". If you read ancient texts, and they say something like "bla kinds of models will not be capable of scheming" then they might mean the Carlsmith concept, rather than the Apollo concept. In general, the Carlsmith concept is probably more worrying, harder to catch, and requires more capable models. (I think what we now call "scheming", would probably have been called "instrumental covergence of power-seeking and deception".)

Originally, this meant reverse-engineering the internal representations and mechanisms in a neural network. For example, here's Neel Nanda's Grokking paper:

We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components.

(Jan 2023)

Or here's Redwood's paper on [Indirect Object Identification](https://arxiv.org/abs/2211.00593):

Work in mechanistic interpretability aims to discover, understand, and verify the algorithms that model weights implement by reverse engineering model computation into human-understandable components.

(Nov 2022) I think mech interp now means any technique that involves looking at the internals (weights or activatons) or a model in order to understand/explain/predict its behavious. For example, see A Pragmatic Vision for Interpretability.

There's no field consensus on what mechanistic interpretability actually is, but we've found this definition useful:

Mechanistic: about model internals (weights and activations)
Interpretability: about understanding or explaining a model's behavior
This could be a particular instance of behaviour, to more general questions about how the model is likely to behave on some distribution
Mechanistic interpretability: the intersection, i.e. using model internals to understand or explain behavior

**What used to be called "mech interp" is now called "ambitious mech interp". **If you read the ancient texts, and they say something like "mech interp would allow us to detect deceptive alignment" then they might be talking about "reverse engineering model computation into human-understandable components", not "using model internals to understand or explain behavior". (I think what we now call "mech interp", used to be called "transparency techniques".)

source & further reading

lesswrong.com — original article Just a Wrapper? How Much Do Scaffolds Matter? Why are adversaries assumed to be incapable of responding to AI risk? Should we combine protocols for AI Control Research?

~/api · this article 200

$curl api.wpnews.pro/v1/news/what-did-scheming-and-me…

Read original on lesswrong.com → www.lesswrong.com/posts/NraMusoWhj9Njdpi5/what-d…

mentioned entities

Apollo Research