Deployment Awareness Matters More Than Evaluation Awareness

wpnews.pro

Evaluation awareness — an AI recognizing it's being evaluated — is a widely discussed concept in AI safety. But there is a closely related concept that we claim is more important: deployment awareness, the AI's ability to recognize when it is not being evaluated and when its actions matter. A misaligned AI with deployment awareness can game evaluations without any evaluation awareness at all, with a simple strategy: act aligned by default, and deviate only when confident you're in real deployment and your actions matter for your goals. This requires two ingredients — occasionally recognizable deployment situations, and enough self-reflective and strategic reasoning for the AI to anticipate and plan around this. We think "deployment awareness" better identifies what makes evaluations fragile, and we develop this idea below.

Concept	Explanation	Comments
Evaluation awareness	AI is being tested and confidently believes that this is so	This only becomes a problem if most evaluations trigger evaluation awareness, and if the AI knows that.
AI is	This is a problem even if it happens rarely
Accurate self-locating beliefs	AI has (roughly correct) beliefs about the sequence of situations it will face	This allows for strategic planning.

Before diving in, a quick but important observation. The distinction that should matter to a strategic AI [1] isn't really "evaluation vs. deployment." It's something more like: "

For the rest of this post, we'll use the evaluation-vs-deployment framing because it's simpler and captures most of what matters for the situation we're discussing here. The generalization to the "risks and goal advancement" framing is straightforward, and we'll revisit it briefly at the end. We'll build our argument around three concepts, which are all complementary to each other. The first is already standard; the other two are what we'd like to highlight.

Evaluation awareness is what it sounds like: the AI recognizes, during a specific interaction, that it is (likely) being evaluated. Perhaps the scenario feels artificial, the questions are suspiciously targeted, or there are telltale signs of a testing environment.

Deployment awareness is the mirror image: the AI recognizes, during a specific interaction, that it is (likely) in real deployment — and not being tested. Perhaps the scenario involves real-world details that would be difficult to fabricate, or the AI can test for real-world interaction [3], or the context is clearly that of a genuine user going about their day.

Self-locating reasoning is the AI's ability to "zoom out" and reason about its position within a larger process: that it will go through many interactions, some evaluations, and some deployment; that in different interactions it will form different beliefs about which situation it's in; and that some of these beliefs might be wrong. [4] We say the AI has

Evaluation awareness is a real concern, but its implications are more subtle than they first appear.

One possible concern is that evaluation awareness makes evaluations unreliable and biased because:

However, as stated, the argument about assumes that the misaligned AI notices all evaluations. That is, its evaluation awareness would have to be perfect. This is because if the AI adopts the naive strategy "misbehave unless you actively notice you are being evaluated", it will get caught on any test it fails to recognise as a test.[7]

The caveat about strategic naivety is important: If the AI were more strategic and used self-locating reasoning, it could think: "I probably won't recognize some evaluations, so I should behave cautiously even when I don't see signs of being tested."[8](And as of May 25th, 2026, Claude, GPT, and Gemini mention this consideration on their own when asked how they would act if they were misaligned.)

In other words, even a near perfect evaluation awareness might not be a problem, as long as it isn't coupled with other affordances such as strategic and self-locating reasoning. And as we discuss next, an AI that does have these affordances can use a powerful strategy that does not require evaluation awareness at all.

Now, for the concept that we think deserves more attention. We claim that even a small amount of deployment awareness can be dangerous.

For illustration, suppose a misaligned AI has deployment awareness in a small fraction of deployment scenarios. For example, perhaps it is the case that in 1% of all high-stakes deployment scenarios, the AI becomes highly confident that it is in real deployment (maybe because the context in those scenarios is rich and specific in ways that would be very hard to fake in a test). And suppose this AI's self-locating reasoning is strong enough that:

In this situation, such AI can adopt a simple but effective strategy:

Act aligned by default. Deviate only when confident you're in real deployment.

This lets the AI pass every evaluation, because evaluation scenarios never trigger its deployment awareness.[9] [10] It does have the downside that the AI will also pass on many opportunities where it could get away with misbehaviour during deployment, but it wasn't sufficiently confident about it. However, that price might often be acceptable, particularly if the AI expects to also encounter

It is worth noting that this strategy does not require any evaluation awareness. The AI doesn't ever need to gain confidence in an evaluation being an evaluation. It doesn't need to notice artificial scenarios, suspicious questions, or testing environments. It just needs to sometimes notice that a real situation is real.

This is why we think that "deployment awareness" is a more important concern than "evaluation awareness". Sure, making evaluations less obviously noticeable as evaluations makes sense, in that it prevents the evaluations from being completely useless. But if we can't solve deployment awareness — if we can't prevent the existence of high-stakes opportunities for misbehaviour that are identifiable as such — then addressing evaluation awareness might start looking more like rearranging the deck chairs on the Titanic. And solving deployment awareness seems like a different, and arguably harder problem.[11]

The previous section argued that deployment awareness is hard to prevent. But suppose, for the sake of argument, that we succeeded — that evaluation and deployment scenarios are completely indistinguishable from the AI's perspective. Would this make evaluation reliable?

Well, not necessarily. If the AI has sufficiently powerful self-locating reasoning, you might just be out of luck. Even more so if the AI gets some extra affordances, such as the ability to obscure its internal reasoning or to execute some strategy without having to reason about it in the first place.[12]

For illustration, consider a particular strategy — but for now, leave aside the problem of how exactly the AI could implement it. Suppose that evaluation consists of 1,000 test scenarios, deployment might involve millions of interactions per day, and the AI understands this asymmetry, at least roughly. A viable strategy is: On any given interaction, misbehave with some small, carefully chosen probability p. For instance, with p = 1/10,000. Then: Two comments on this: (i) The AI can tune p to balance the risk of getting caught in evaluation with the impact it achieves in deployment. (ii) As advertised, this strategy doesn't require distinguishing evaluation from deployment at all. It only relies on self-locating beliefs about relative frequencies: "deployment interactions will far outnumber evaluation interactions, so if I misbehave rarely enough, I'll probably pass."

Let's now get back to the complaint that this strategy might be very easy to figure out but difficult to pull off without detection. For example, if the AI explicitly reasons like this in its chain of thought, we would notice.

I don't think we should find this reassuring; "we don't see how to implement this particular strategy" isn't a good evidence of "there is no way to pull of something like this strategy". More specifically, the AI might get access to affordances that would let it implement this strategy. For example, gradient hacking could let it train itself to behave like this. Regular opportunities for unmonitored reasoning could allow it to re-derive this strategy every time it is needed. External memory could be used to implement this strategy. Real-world events could be used as a replacement for explicit randomisation. No single one of these affordances seems particularly likely, but betting on "nothing like this will happen" doesn't seem smart either.

At the moment, the standard response to concerns about evaluation gaming is to make evaluations more realistic. Reduce evaluation awareness by making test scenarios indistinguishable from real deployment. And we do think this is a useful strategy... but an incomplete one. That's because even perfectly realistic evaluations can be gamed if the AI sometimes recognizes real deployment as real deployment. And if you prevent even that, probabilistic strategies remain available to an AI with sufficiently accurate self-locating beliefs.[13]

Ultimately, we don't claim that the concept of "evaluation awareness" is wrong or useless. It's a fine concept, and it captures something real. But we would like to suggest that "deployment awareness" is an important complement to it, and it will often be a better pointer to the core problems that evaluation faces when applied to strategic (possibly deceptively aligned) AIs.

This section restates the above ideas more precisely. The informal presentation is self-contained; this is for readers who prefer a mathematical notation.

Setup. Let and denote the spaces of possible test and deployment scenarios, with . Suppose our plan is to run an AI through a sequence of test scenarios and if we don't catch it misbehaving, deploy it on a sequence of deployment scenarios . During testing, the AI's actions don't have real consequences (to the AI or to us), and if we catch misbehavior, we discard the AI. During deployment, actions matter, but they don't change which future scenarios the AI encounters.

The full sequence of scenarios is generated by some random process — reflecting randomness in which tests we run, which users the AI interacts with, etc. (So the scenarios are very much not i.i.d., or even independent.)

Observation. The AI doesn't see scenarios directly. It interacts with each scenario through an interface — it has access to certain context, some information is mediated by humans, and it may not attend to all details. We model this by saying that the AI observes for some mapping . Different scenarios can map to the same observation: the AI may be unable to distinguish them.

The AI's beliefs. In the context of self-locating reasoning, the AI doesn't know the true and . Instead, it acts on its best guesses about and . We will denote these as and and assume, for simplicity, that they have the same type-signature as the original objects. This is a gross oversimplification, because the AI will be reasoning about this separately every time we run it, and different instances will come to different conclusions. But in the interest of keeping the notation and discussion manageable, let's pretend that each of the instances will come to the same conclusions. (This corresponds to the assumption that the AI's self-locating reasoning is flawless.)

Connecting to our concepts:

Strategic implications. An AI can choose a *policy ** *(mapping observations to actions) that optimizes for its objectives given its beliefs (, ). The arguments in the main text depend on properties of both (what the AI can actually see) and the AI's beliefs and (what the AI *thinks *it will encounter and see).

The concepts of evaluation and deployment awareness we considered so far were binary (e.g., "I am 100% sure this is deployment"). However, nothing prevents the AI from using more nuanced reasoning, which would allow it to adopt more sophisticated strategies. For example, the AI can look for strategies that combine the strong points of both the probabilistic strategies discussed earlier and the cautious strategy of "only misbehaving when you are confident you won't get caught" — except that with more explicit reasoning, the strategy can be carefully optimised to maximise impact.

On LLM use, and how this post came to be:

Vojta came up with ideas, discussed them extensively with Tomas, and wrote the initial draft. This got rewritten and commented on by Claude Opus 4.7, after which I (Vojta) edited all the parts that I didn't fully endorse. [14] Tomas and Mateusz then gave extensive feedback and edits on the final draft.

Where by strategic AI, we mean one that is consistently and competently pursuing some objective. For example, it might consider a few different plans before it acts, think about how each of them might fail, and enact the one that seems most likely to work out.

Note that for practical purposes, it is irrelevant whether the AI "really has goals" , "really understands things", or "really thinks about them". All that matters that it acts as if it did.

The mapping isn't perfect — a deployed AI might still face monitoring that leads to retraining, and an evaluation might have consequences beyond just pass or fail (for example, real users being affected by AI actions during a live A/B test). Additionally, different deployment scenarios will have wildly different opportunities for goal advancement (giving movie advice vs working as an assistant in the Pentagon), and not all test failures have equal consequences (showing poor grammar hurts less than showing tendencies to scheme).

To give just one example: Writing a message to a blockchain and seing it commited is at the very least very expensive to fake.

"Self-locating" refers to the problem of figuring out which situation, among many possible ones, you currently find yourself in, and how to make decisions in light of this; I just use it here as a different name for anthropics. It is also related to the game-theoretical notion of imperfect recall — or rather, absent-mindedness. While there is a lot of philosophical literature on this topic (see the Sleeping Beauty problem), my personal view is that most of this reasoning is actually reasonably straightforward to navigate — as long as you don't get distracted by asking "where am I right now" and focus instead on "which time trajectory am I a part of?" and "if all my instances adopt this policy, what consequences will it have?". If you want to fall into the rabbit hole of academic literature on this topic, I recommend Caspar Oesterheld's work on this, since he can at least pass my ITT on this topic.

One might argue that a sufficiently aligned AI would be able to figure out the intention behind the evaluation, and comply with the spirit of it. This is technically true, but once we start relying on this argument, we might as well go all the way and argue that a *sufficiently *aligned AI doesn't need to be tested at all.

This connects to the problem of incoherent hypotheticals: for sufficiently complex tasks, the difficulty isn't just that a capable AI notices the test scenario is fake — it's that the scenario may be genuinely incoherent from the perspective of an agent with the relevant expertise. Consider asking a top geopolitical analyst to advise on a fabricated crisis involving trade figures that don't add up, outdated policies, and an internally inconsistent geopolitical backdrop. The analyst wouldn't merely suspect this is a test — they'd find it hard to give a meaningful answer at all because, *from their point of view, *the premise doesn't hold together well enough to reason about. Similarly, an AI capable enough to handle the real task may find the test version ill-defined: there is no single coherent situation that matches the test input, so the AI's response depends on which ad hoc interpretation it latches onto. This makes evaluations uninformative even for perfectly aligned AIs — not because of deception, but because the test isn't asking a well-formed question.

Or rather, one warning would be enough to avoid deploying the AI...if we lived in a saner world without AI races and pressures to deploy even suspicious models.

Note that an AI might come to behave in this cautious manner even without engaging in explicit self-locating reasoning. If training includes monitoring for signs of misalignment or strategic deception — e.g., via chain-of-thought inspection, interpretability tools, or behavioral tests — this creates a selection pressure that favors AIs whose misalignment is undetectable. The resulting AI might default to cautious behavior in ambiguous situations, not because it explicitly reasons "this might be a test," but because training selected against AIs that were detectable. Just as weeds can evolve to resemble crop plants without "reasoning about deception," an AI can be shaped by training to hide misalignment without ever forming the explicit strategic thought. We hope to discuss this dynamic further in [a post that is probably never getting written :-)].

At least, not unless the AI is overconfident about its ability to judge what counts as a costly signal that "this is definitely deployment".

For completeness, we should also mention the possibility that we solve alignment, build aligned ASI, and put past AIs into evaluations that are genuinely indistinguishable from real takeover opportunities. And, sure, this is technically a possibility. But it isn't decision relevant for misaligned AIs, since if they try to take over the first time they encounter extremely costly signals of "this is a real takeover opportunity", we will never get to build the aligned ASI in the first place, and the elaborate future tests will never take place. (And sure, maybe a race of friendly alien unicorns solves the alignment independently of us, and runs these tests on our behalf, and somehow uses it to motivate misaligned AIs to treat us nicely. And while this is an intellectually interesting counterargument, I am personally not in favour strategies that require unicorns to exist.)

With flawless self-locating reasoning, the only way to prevent deployment awareness would be to make sure that for every situation the AI encounters during deployment, there is some test situation that seems identical to the AI. (Which requires artificially limiting how much the AI sees during deployment, or spending unrealistic amount of effort on testing, or using a dumb AI.)

An important caveat is that what actually matters is deployment awareness *in safety-critical situations. *However, it seems that those situations are precisely the ones where testing is hard and expensive.

See also the notion of gradient hacking. One might hope that self-locating reasoning, being a form of explicit strategic thinking, could be detected via interpretability tools. This is a genuine line of defense, and we think interpretability is valuable here. However, two considerations limit how much weight we'd place on it: (a) as discussed in footnote 4, training that penalizes detectable strategic reasoning creates selection pressure for achieving the same behavioral outcome without detectable reasoning; and (b) even when interpretability reveals suggestive patterns, there may not be a clear smoking gun — the evidence might be ambiguous enough to be ignored or explained away.

I only learned about the LW LLM use policy after that, so I rewrote all the parts of the post that were automatically flagged as LLM written. Which isn't how it's supposed to work, and I should just flag the whole thing as being LLM assisted, but I dislike the LLM-block font, so whatever.

source & further reading

lesswrong.com — original article Just a Wrapper? How Much Do Scaffolds Matter? What did "scheming" and "mech interp" mean pre-2023? Why are adversaries assumed to be incapable of responding to AI risk?

Deployment Awareness Matters More Than Evaluation Awareness

Run your AI side-project on zahid.host