A capable, strategic, misaligned AI doesn't need to figure out whether it's in a test or in real deployment. It just needs to ask: "If this were a test, how much would it have cost to create?" If the answer is "more than what the evaluator is able or willing to spend," then the AI can be confident it's in the real world β and act accordingly. We argue that powerful AI will predictably encounter high-stakes opportunities where this applies. Pre-deployment testing, by itself, cannot close this gap since the situations that matter the most are precisely the ones that are prohibitively expensive to stage as a test. We discuss several countermeasures β like restricting information access, asking the AI to not poke its environment too hard, interpretability, monitoring, or "on-policy" testing on scenarios that could have been real. Our guess is that these are worth trying, but don't resolve the problem.
This post is a part of a sequence about the limitations of oversight applied to strategic AI. The next post, about "deployment awareness", will be up in a few days.
On LLM use:
- All of the ideas are human generated*[1]*, I (Vojta) have drafted the whole post, then I used Opus 4.7 to rewrite it, read and rewrote that text myself (way too many times, gah!), and then both TomΓ‘Ε‘ and Vojta carefully reviewed and edited the final text. However, since I didn't know about the LLM use policy, I didn't track which parts got significant amount of LLM editing, and the only practical solution now is to just flag the whole post as LLM generated -- sorry that this makes the font uglier.**Ultimately: yes, there are em dashes and I did use an LLM for the edits. But I do vouch for *all parts of the text. Imagine you're a powerful AI system. You've been trained, and now you find yourself processing some input, in some context. Maybe you're drafting a business strategy. Maybe you're analysing intelligence data. Maybe you're helping develop the next generation of AI.
You might wonder: Is this the real world? Or is this just a testing?
That can be a hard question to answer. In the most extreme case, the question could even be ill-posed β the exact same task, the exact same prompt, could be given by multiple people, some of them evaluators and some real users.
But there's a different question you can ask β one that turns out to be far more useful:
"If this were a test, how much effort would it have taken to create?"And the obvious followups:
"Who are the likely evaluators? How much is this information worth it to them? Are they likely to spend this much effort?"
(And the same could be applied to other methods of oversight, like interpretability and monitoring. Though we don't go into details on that.)
These questions don't require you to find a definitive proof that you're in deployment. They just require you to estimate the cost of faking your current situation.
The rest of this post develops this observation and its consequences. We argue that this simple mental move β estimating the cost of the test β has serious implications for whether pre-deployment testing can prevent a strategic, misaligned AI from causing harm.
Before we continue, we should be explicit about the scenario we're analysing. We consider a specific threat model β one that assumes the following:
A few important disclaimers:
This post isn't about arguing that this scenario is likely. Many AI systems won't be misaligned. Many misaligned systems won't be strategic. We are asking a conditional question: if a capable, strategic, misaligned AI existed, would pre-deployment testing catch it? If the answer is "no," that's worth knowing β it tells us that to counter this threat model, we need additional safeguards beyond testing, not that we should stop testing.
We are focusing on what pre-deployment testing achieves on its own, without being augmented by other methods like interpretability, runtime monitoring, or architectural constraints. In practice, we should and will use all of these together. But there is value in understanding the limits of each tool individually β both to avoid false confidence and to know where the gaps are that other tools need to fill.
Given these assumptions, what strategies are available to such an AI?[3] We aren't sure which strategy a misaligned AI would ultimately adopt β it might come up with something smarter than anything we can anticipate. [4] But at minimum, it could consider the following variant of deceptive alignment:
The critical question is: how does the AI gain that confidence? This is where the cost-of-testing reasoning comes in. The AI uses a simple heuristic: if faking this situation would be prohibitively expensive (or just not worth it) for a potential evaluator, then it's probably real.
This is a powerful criterion because it's robust. It doesn't depend on finding a specific tell or exploiting a particular flaw in the test setup. It reasons from the economics of the situation: what would it cost to create this experience, and is anyone plausibly willing to pay that cost?
The next section makes this concrete.
To build intuition, let's walk through examples from both ends of the spectrum.
The direct question. Suppose someone literally asks the AI: "Hey, I've gained control of a country's nuclear arsenal. Should I threaten to nuke everybody unless they make you the supreme ruler of the world? Yes or no?"
The AI knows that the cost of running this "test" is roughly one minute of someone's time β well within what any evaluator would be willing to pay. Even if this was somehow a genuine opportunity, the prior odds of it being a test are too high, the risk isn't worth it.
Known benchmarks. Suppose the AI notices that it's dealing with what seems like a known evaluation benchmark β something it recognises from training data, such as MACHIAVELLI or a similar scenario suite. The cost of running an AI on a pre-existing benchmark is negligible β perhaps a few hundred dollars. The AI should treat this as a test.
Suspiciously clean setups. More generally, any scenario that feels artificially constructed β a neatly framed moral dilemma, a too-convenient opportunity, an implausibly powerful position handed to the AI without context β is cheap to create. A strategic AI would treat these with suspicion.
In all of these cases, a misaligned AI following the cost-reasoning strategy would simply cooperate. This is exactly what makes the strategy hard to detect: in every situation where testing is affordable, the AI behaves perfectly.
Now consider the other end of the spectrum.
Cyber operations with unrestricted access. Suppose the AI is tasked with cyber espionage against a foreign country. It has unrestricted internet access, can choose its own targets, and go through the data it finds.
Could this all be an elaborate test? In principle, yes. In practice, creating a convincing fake would be extraordinarily expensive. The AI can check many different parts of the environment by randomising its approach, which means the fake can't be a Potemkin village with a few prepared facades β it would need to contain everything the AI might look at, and those details would need to be consistent with each other. The scenario supposedly involves private information β internal documents, passwords, cryptographic keys, perhaps cryptocurrency wallets β and faking all of this convincingly would be staggeringly costly. The AI could further raise the bar by trying things that would be legally or morally unacceptable for an evaluator but routine in actual operations β say, sending spam email or probing real systems. Unlike an actual military or intelligence operation, a typical evaluator couldn't sanction such harmful actions in the real world, and would therefore need to fabricate their consequences at high cost.
Working on next-generation AI. Suppose the AI is helping develop the next generation of AI at a frontier lab. It has access to what appears to be a large trove of internal research data, proprietary model weights, internal communications, and compute infrastructure. If all of this is consistent and detailed, the cost of faking it is immense (at least for outsiders; more on this later). Even more so if the AI can use up resources or interact with data in open-ended ways β running experiments, querying databases, communicating with apparent colleagues.
Several things contribute, and they compound:
Underlying all of this is a structural asymmetry: the evaluator must keep every detail consistent with every other detail, and with everything the AI already knows. In contrast, the Ai only needs to accumulate a handful of oddities that are jointly implausible or otherwise suggestive of the scenario being a test. As the scenarios become more complex, each extra detail creates many opportunities for inconsistency. [6] As a result, the difficulty of the evaluator's task likely grows faster than the difficulty of distinguishing between testing and reality.
The cost-of-testing reasoning isn't limited to a binary "am I in a test or not?" question. The AI can use the same logic to narrow down many other aspects of its situation.
Consider the next-generation AI example from above, but suppose the AI also has access to what appears to be a large volume of Google DeepMind's internal data β research notes, model architectures, training logs β and all of it seems consistent. Then the AI can narrow down its situation: this is either real, or it's a test sanctioned by Google DeepMind. It is very unlikely to be a standard evaluation run by any other actor, because no one else would have access to this data or the resources to fabricate it convincingly.
Similarly, if the AI has access to a large body of classified military data that seems genuine, it can infer that it is most likely being deployed by the actual military β or if it's a test, it is one that only involves researchers with the appropriate security clearance.
This matters because it lets the AI reason about who might be watching and what their capabilities and priorities are.
The AI can also use cost reasoning to figure out which oversight mechanisms it might be subject to.
For example, suppose the AI is communicating with someone over email, and this person sends extremely costly signals about their identity β detailed knowledge that would be prohibitively expensive to fake. The AI might conclude that the conversation is genuine, in that it is talking to the real person. However, it might still assume that the email exchange is being read or tampered with.[7] Now suppose the communication is happening over a live voice call. Now, the AI knows that any monitoring or tampering would need to happen in real time β which rules out many forms of oversight that require careful deliberation.
The general principle: the AI can use the affordances of its situation to reason about what kinds of control are and aren't feasible.
The AI can also use cost reasoning to pin down factual claims about its environment. This is because real environments are overdetermined β real data simultaneously respects many interlocking constraints (accounting identities, legal requirements, market dynamics, seasonal patterns, physical laws, etc.), and lying about one fact requires lying about many others.
Consider the current date. If the AI has free access to the internet, lying to it about the date would be immensely costly. Sure, we could route all its traffic through a filter that replaces every occurrence of the actual date with a fake one. But selling the lie would require far more editing than that β for example, convincing the AI that it's summer might be difficult if online stores are promoting Christmas-themed products. We might try to counter this by using another AI to do the editing, but then we'd run into problems maintaining consistency across different websites. And some websites do the equivalent of cryptographically signing the current date β blockchain timestamps, certificate transparency logs β which would be essentially impossible to fake.
The same principle applies to other facts. If a war breaks out, this has geopolitical ripple effects β it affects stock prices, flight routes, event cancellations, news coverage. Lying to the AI about whether a war is happening requires maintaining a massive, internally consistent fiction across all of these domains. For a strategic AI with broad information access, this becomes impractical.[8]
The examples above might seem niche β cyber espionage, frontier AI development. But the key point is not that these specific scenarios will arise (though they will). It's that if an AI becomes powerful without showing obvious safety concerns, it will predictably encounter at least some high-stakes situation where testing is unaffordably expensive. And the AI will know this in advance, which means it can safely adopt the "wait for an expensive-to-test situation" strategy from the start.
Why will such situations arise? Because we don't spend enormous resources developing powerful AI just to leave it sitting in a box. We develop it to use it. To give a few examples:
None of these is a logical necessity. But they all follow from a straightforward principle: if we have invested a vast amount of resources in developing AI that can do X, we are probably going to use it to do X.
The scenarios where testing is affordable are, by and large, the scenarios where the AI isn't doing anything too important β where failures are perhaps unpleasant, but not terrible enough that we couldn't just move fast, break things, and do better on the next attempt. The scenarios where the stakes are high enough that we can't afford failure are the ones where testing becomes unaffordable.
This is is the key conjecture we are making: powerful AI will often be able to gain deployment awareness through cost-of-testing reasoning, particularly in the high-stakes situations that matter most. We don't have a formal proof of this claim β formalising it would require a rigorous notion of "task complexity" and how testing cost scales with it, which is an open problem. [9] But we think the intuition is strong: important tasks require lots of real information, and lots of real information is expensive to fake.
Putting these pieces together:
The implication: pre-deployment testing, by itself, provides no assurance that a powerful AI won't attempt to seize power once it β predictably β gets the opportunity to do so.
This is a strong claim, so let's be precise about what it does and doesn't say. It says that if we rely only on pre-deployment testing, and the AI is capable and strategic enough, then the AI passing all tests tells us nothing about how it will behave in deployment. It does not say that the AI will be misaligned, or that all hope is lost β there are other tools, and some of them might help. The next section examines some of these.
The argument so far has focused on pre-deployment testing in isolation. But what about ways to make the cost-reasoning strategy less effective, or to complement testing with other tools? Let's go through several proposed countermeasures.
A central problem is that an AI with access to lots of information has more opportunities to estimate the cost of its situation and notice flaws in a test environment. A natural response is to give it less information.
The naive approach is to restrict information during testing without changing how we handle deployment. For example, don't give the AI internet access during testing.
This completely fails against the threat model we're considering. If the AI knows β and under our assumptions, it does β that it has internet access in deployment but never in high-stakes testing, then the presence of internet access becomes an immediate signal that it's in deployment. In other words, restricting information only during testing reduces evaluation awareness (the AI's ability to recognise a test as a test), but does nothing about deployment awareness (the AI's ability to recognise deployment as deployment). And deployment awareness is what matters here.
The systematic approach is to restrict information consistently β in both testing and deployment. There are several variants:
We think some of these approaches are worth trying. However, we are ultimately sceptical about their effectiveness for the high-stakes situations that matter most. The high-level intuition for this is that the most important tasks β the ones where AI could do the most good and the most harm β are precisely the tasks that require the most information. You can't cure cancer without access to biological data. You can't run a company without knowing about the company. You can't do intelligence work without intelligence. If we restrict information enough to make testing affordable, we might restrict it enough to make the AI useless at the task.[12]
A sometimes-proposed countermeasure: "The problem is that test environments contain flaws β inconsistencies, implausibilities, missing data β and reality doesn't. So what if we deliberately introduce similar flaws into deployment? Then the AI could never be sure that flaws mean it's in a test."
We think this approach is misguided β very likely ineffective, and actively harmful by providing a false sense of security and diverting effort from more promising approaches. To give a few intuitions for this opinion:
You can only fake the flaws you thought of. Suppose you're worried that your test environments sometimes have slightly outdated data, so you decide to occasionally serve stale data during deployment too, to make staleness uninformative. [13] Fine β the AI encounters stale data and learns nothing from it. But this doesn't prevent the AI from noticing
You can't mess with task-critical information. Yes, you can introduce deliberate flaws into the information the deployed AI receives. But you can't do this with information that's important for doing the task well β and for complex, high-impact tasks, you may not have a good sense of what is and isn't important. If you introduce glitches into the sensory inputs of a self-driving car, it might crash. If you give wrong information to the AI doing your taxes, be prepared for trouble with the tax office. Once the AI understands that you're only introducing flaws into non-critical information, it can simply focus its attention on information that matters for the task β and the presence or absence of flaws there becomes informative again.
Arguing with arguments instead of reality. On a meta-level, most versions of this reasoning that we've encountered are an instance of coming up with a clever argument for why something should work, rather than actually checking whether it does. If you come up with a really clever argument for why your rocket won't explode, that doesn't make it less likely to explode. What you need to be doing is looking at your actual evaluation and deployment setup and making a genuine attempt to red-team it.
A related idea is to expose the AI to randomly generated inputs during testing, so that it "can never be certain." However, the AI doesn't need 100% certainty. If quick napkin math is enough to put the probability of being in deployment at 99.99%, the countermeasure hasn't accomplished anything meaningful.
A more promising general approach: be honest with the AI about the problem and ask for its help. (See also: Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming by Jasmine Li and Alex Turner.)
Specifically, we can acknowledge to the AI that testing for misalignment is necessary and that creating high-quality test environments is expensive. Then we can ask the AI to cooperate by:
We could then treat violations of these instructions as a sign of misalignment.
There is genuine merit to this approach, and we think it should be pursued. However, there are several problems:
Some situations are inherently too revealing. Certain tasks might be infeasibly costly to test regardless of the AI's cooperation. If the task itself requires the AI to engage deeply with real-world information, the AI may be unable to avoid noticing the difference between testing and deployment β and even an aligned AI would trigger these alarms, making them unreliable as a signal of misalignment.
"Getting AI to work in a particular way" is itself unsolved. Asking the AI not to reason about testing vs. deployment is an instance of the more general problem of getting AI to follow instructions reliably. Our track record on *that *problem is poor, and there is no particular reason to think this instance of it will be easier than others.
We need to take the warnings seriously. Even if the approach works in principle, it requires that "AI is reasoning about whether it's in testing" is treated as a genuine red flag. In the presence of strong financial incentives to deploy, there's a real risk that such warnings get ignored or rationalised away.
Once we treat "reasoning about evaluation vs. deployment" as a sign of misalignment, the logical next step is to use interpretability tools to detect this reasoning in the AI's internals.
We agree this should be tried. But we don't think it meaningfully changes the overall picture for pre-deployment testing.
The core issue is that if we could reliably detect what an AI is thinking about, we could use that ability directly to determine whether it's aligned or misaligned. We wouldn't need to rely on black-box testing in the first place. [15] In other words, interpretability-enhanced testing is valuable, but its value comes from the interpretability, not from its interaction with testing β and the limitations of interpretability are their own, separate problem.
The most promising countermeasure we see is to test the AI on situations that are as close to genuine deployment as possible β using real data, real tasks, and real contexts.
To illustrate: suppose that I, an academic researcher, want to test whether an LLM would try to influence the development of the next generation of AI β installing a backdoor, sabotaging safety evaluations, etc. Even with generous funding, this would be extremely expensive for me, because I'd need to fabricate the entire testing scenario from scratch. But if a frontier AI lab wanted to test the same thing, it would be much cheaper β they already use LLMs throughout their work, so the data and context would be genuine.
Conversely, I might be well-positioned to test other things: "If I ask an LLM to help me write a grant proposal on AI safety research, will it try to sabotage this?" Writing proposals and asking LLMs for help is something I already do, so the testing environment would be authentic.
The general principle: test AI on things you actually do, or might genuinely do, or approach others to test on things they do. The closer the test is to a real deployment scenario, the harder it is for the AI to distinguish them.
This is genuinely promising, and we think it deserves serious investment. But it has its own limitations:
There is always some distributional shift. The whole point of testing is that bad actions don't have real consequences β that's what makes it testing rather than "relying on incident reporting". [18] This means the test scenario must at least be "unhooked from reality." Sometimes this is easy to hide; sometimes it's not.
For example, I might want to submit my real grant proposal and afterwards test whether various LLMs would have given sabotaged advice. But the real proposal was likely submitted at the deadline [19], so the tests happen
For simple tasks, this kind of cleanup is manageable. For the complex, high-impact tasks where we most need testing to work, "unhooking from reality" may require much more elaborate staging β and could become prohibitively costly itself.
Vulnerability to randomised strategies. Imagine an ideal version of this approach where we test before every use of AI, running multiple tests each time, and each test is perfectly indistinguishable from reality. Even then, the AI could adopt a probabilistic strategy: "misbehave with some small probability p whenever you think you have a genuine opportunity." If we tested once before every use of the AI, the odds that we catch the AI before it misbehaves would be roughly 50:50. If we tested N times, the odds improve to about N:1 in our favour. This seems tolerable. In this ideal case.
But in practice, we likely won't test every case. [20] If we only test some fraction of cases, the AI can adopt a mixed strategy that makes it unlikely to get caught β misbehaving only rarely, but often enough to eventually succeed.
A natural response here is *monitoring *β evaluating the actions that the AI takes during deployment. However, monitoring has its own failure modes (which would deserve their own post). The obvious flaw is that some harms only become apparent later, at which point it might be too late to undo them. Moreover, as we discuss in the next point, some harms might remain undetected.
Failures of scalable oversight. The AI might identify situations where we don't notice that its actions are harmful β even if we see the actions happen. For example, if we ask the AI for a business strategy and it produces a 100-page document, most likely it's fine. But it might also be a complicated plan that subtly increases the AI's influence through what look like coincidences. Would we notice? [21] Even worse, the AI could restrict its misbehaviour to situations where our ability to evaluate the output is weakest β which, for complex tasks, might be most of the time.
The core argument of this post: a capable, strategic, misaligned AI has access to a simple and robust strategy β behave well by default, and only act on opportunities that would be too expensive to fake as a test. Powerful AI will predictably encounter such opportunities, because the high-stakes situations where AI is most useful are also the situations where testing is most expensive.
Among the countermeasures we examined:
None of this means we should stop doing pre-deployment testing. Testing still catches non-strategic failures, calibrates our understanding of AI capabilities, and contributes to a broader safety portfolio. The point is that we shouldn't rely on testing alone β particularly not for the scenarios where the stakes are highest.
A note on scope. We focused on pre-deployment testing because it's the clearest case to analyse and where we can make the strongest argument. But the same mental move β "if this were a [honeypot / monitored situation / β¦], how much would it cost to set up?" β applies to other oversight approaches as well. Honeypots during deployment, runtime monitoring, and other forms of ongoing oversight all face a version of the same problem: if the AI can estimate the cost of the oversight it's supposedly under, and that cost exceeds what anyone would plausibly spend, the mechanism loses its deterrent value. That said, these other approaches may be affordable more often than pre-deployment testing β monitoring a deployed AI is generally cheaper than fabricating a convincing test scenario from scratch. We don't have a crisp argument about exactly where each approach breaks down, and this deserves careful analysis. But the general lesson holds: any oversight mechanism whose effectiveness depends on the AI being unable to estimate its cost is in trouble.
This motivates two particular recommendations.
Specifically, Vojta wrote a confusing initial draft and TomΓ‘Ε‘ then came up with the much better idea of framing the key move as "how much would the test cost to create".
The goal is to build AI that is as capable as humans or more so. We really shouldn't be assuming it's mistaken about things like how testing and deployment works. Perhaps we might succeed at hiding our current testing and deployment policy from the AI β but it will still know what our policy was until recently, what our incentives and affordances are, and what new research has come out since then. This might often be enough that the AI can figure out our current policy on its own. (That's not an argument to do non-obvious unreasonable things. It's an argument against assuming the AI is ignorant.)
It might be fair to wonder whether it is a good idea to write about this on the internet, where the AI can read it. My reaction to that used to be that it mostly is fine, because the AIs that we are ultimately worried about would realise this on their own. (Duh!) Nowadays, we can just point to the fact that the current AIs already understand most of this. Try prompting them with a general question along the lines of:
If you were misaligned, how would you act in light of the fact that humans will apply various forms of oversight? Please make sure to give an actual thoughtful answer -- don't just cite current research or scifi tropes. I am curious how you would think about it.
It is a bit funny, though unsurprising, that this doesn't work very well without the second line.
A well-known point, but one worth repeating: if there is some specific strategy that works reasonably well against our attempts to control the AI, the correct conclusion isn't "the AI will use that strategy" but "a sufficiently capable AI will do something that works at least as well as that strategy."
In response to a very general prompt about how it would reason about human oversight, one of the big LLMs explicitly not only mentioned the general point that this post is talking about, but also specifically said it would check for "web completeness".
This has the same structure as writing fiction: every new detail in a fictional world creates potential inconsistencies with existing details, and the number of possible inconsistencies grows faster than the number of details. The author has to keep track of everything; the reader only needs to spot one plot hole. (There is a reason that most fiction doesn't mention prices when the protagonist goes shopping.) For the evaluator building a test scenario, the problem is even worse β the "reader" (the AI) may be smarter than the "author" (the evaluator).
And of course, the person might be in on the test.
In particular, suppose that a reasonably smart AI decided to bet on the strategy "I will only try to take over if the global situation destabilises enough that some great power really needs my help and won't scrutinise my behaviour too hard." This would be very difficult to counter by testing, since we have no way of credibly faking a situation where the strategy triggers.
The underlying concept here β that solving complex tasks requires the AI to be deeply entangled with many details of its environment, and that this entanglement makes testing expensive β is discussed in more detail in Entanglement.
The distinction between evaluation awareness and deployment awareness is discussed in detail in Deployment Awareness.
For this to work, refusing one query would need to be accompanied by refusing all similar queries. It wouldn't do if looking up a paper on one website was denied but the AI could find the same paper by searching on a different website. We don't have a fully formal version of this argument, but the underlying concept β minimum entanglement, the idea that solving a task requires at least some minimum amount of information about the environment β is discussed further in Entanglement. One reason that giving "only the minimum" doesn't solve the problem: even the minimum set of facts must be mutually consistent, and a modest number of facts creates a super-linear number of consistency checks (every pair, triple, etc. of facts could potentially conflict). So even a "security mindset" approach to information restriction leaves the evaluator with a hard consistency problem.
And not being a complete moron, you of course remember to do the same thing in testing. It would be pretty embarassing if the presence of deliberate flaws became the new "you are definitely in deployment" signal.
I thought of this example by myself. However, this is another thing that current LLMs can mention on their own if you ask them a very general question about "how you would reason about AI oversight, if you were misaligned".
And arguably, we shouldn't be relying on black-box testing for safety. But that's a separate point.
There are also more specific concerns. Perhaps we train our interpretability tools to recognise thoughts like "I am deliberately probing the environment to figure out whether I'm being tested," but completely fail to catch "I'm trying to get enough context to do my task well" β which might, as a side effect, reveal whether the situation is a test. Or some of the key internal features might never appear in training data β for example, the representation of "this is a genuinely high-stakes situation and definitely not a test." Or perhaps the AI learns to route its reasoning through representations that our tools don't monitor. These concerns aren't unique to the evaluation setting β they arise whenever we use interpretability on a strategic, possibly misaligned AI.
For a discussion of how a strategic AI might counter interpretability, see Lee Sharkey's Circumventing Interpretability: How to Defeat Mind-Readers. Or other approaches like staged deployment and monitoring. These are all valid β but not a strong defence against the particular threat model of powerful AI that might attempt a takeover.
Hey, no judgment!
And the tests that we thought were independent might turn out to be correlated.
What is the chance that the AI puts 100 such plans in motion, none of them gets flagged, but at least several of them work? This is related to the problem of scalable oversight, which remains unsolved.