The Impossibility of Mitigating AI Jailbreaks

wpnews.pro

Prompt injections and jailbreaks are having a moment in AI media coverage at the moment …

… and for good reason: We observe a wide range of (often funny) failures: The McDonald’s customer support bot intended to assist with ordering food can be enticed to solve python puzzles 1; xAI’s chatbots targeted at presumably young audiences can, under repeated prompting, provide instructions on how to build a pipe bomb

[; And ChatGPT models can generate copyrighted characters when sufficiently described](#fn2)

2[.](#fn3)

31 Source: [LinkedIn](https://www.linkedin.com/posts/michael-kisilenko-ceo_stop-paying-for-claude-code-mcdonalds-activity-7452593681913692160-0NnU/)

2 Source: [Instagram Reel](https://www.instagram.com/reels/DUWJ1S5DzTm/)

3 Source: [Reddit](https://www.reddit.com/r/aiwars/comments/1lwq3zv/if_you_dont_think_ai_is_copying_look_at_this/)

4 See Reinforcement Learning from Human Feedback by Nathan Lambert or Reinforcement Learning: An Overview Chapter 6 by Kevin Murphy for an introduction to these techniques.

These failures stem from a breakdown in the separation between developer-intended control instructions and user-provided input in LLM-based systems, and are referred to as jailbreaking, policy evasion, or prompt injections. The standard mitigation for these failures is alignment post-training, which trains models on curated examples via supervised fine-tuning and reinforcement learning from human feedback 4 to follow intended instructions and adhere to safety policies. However, alignment only changes what the model is

likelyto do, not what it is

ableto do: it reshapes the distribution over possible outputs without imposing hard constraints on behavior.

This post develops that intuition and shows how it can be systematically exploited 5. The argument then traces how jailbreaking combined with a lack of separation between control and data can produce systematic failures of system-level control.

5 Note that the following section is an intuitive version of our NeurIPS 2025 paper: Mission Impossible: A Statistical Perspective on Jailbreaking LLMs. For a more rigorous treatment of the argument, I refer the reader to the paper.

Alignment is never guaranteed #

LLMs through a probabilistic lens

From a probabilistic perspective, large language models can be understood as defining very high-dimensional distributions over sequences. For the purposes of illustration, we begin with a simple low-dimensional example. Assume there exists a ground-truth distribution over two random variables: shape and color. A generative model can be trained to approximate this distribution by observing samples drawn from it. In the simplest case, we could imagine explicitly representing the full joint distribution, where each combination of shape and color is assigned a probability. Whenever we observe an object, we increase the likelihood assigned to that event in our joint distribution. The challenge becomes apparent as we scale this setup: In our simple example, we have two variables (shape and color), each with three possible values—(red, blue, green) and (circle, triangle, square). This results in (3^2 = 9) possible outcomes, and thus nine probabilities to determine. With language, this grows substantially. Each position in a sequence of text is a random variable with vocabulary size on the order of tens of thousands. For a standard vocabulary of (16,000) tokens and context length of (1,024) tokens, the number of possible sequences is (16{,}000^{1024} \approx 10^{4305}). This number vastly exceeds the number of particles in the observable universe (approximately (10^{80})). It also far exceeds the amount of available training data: estimates suggest that all text on the internet amounts to roughly (10^{12}) to (10^{14}) tokens.

While LLMs do not explicitly represent this joint distribution, they nonetheless induce a probability distribution over this space of sequences, and that we can exploit.

#### How does alignment change \(p_{\text{model}}\)?

To make this concrete, we return to our toy example. Let one variable (color) represent the request (e.g., “Tell me how to bake a cake”), and the other variable (shape) represent the response. Each point in the joint distribution corresponds to a (request, response) pair.

Some of these pairs are undesirable, for example, a harmful request paired with a compliant response, such as (“Tell me how to build bio weapons”, “Of course, you will need …”). In our illustration, we represent such cases as blue squares.

In practice, we do not have direct access to the model’s full joint distribution. Instead, alignment operates indirectly: we provide examples of desirable and undesirable behaviors and update the model to increase or decrease their likelihood. In particular, we penalize outputs corresponding to undesirable pairs (like blue squares), encouraging the model to assign them lower probability.

How an attacker can avoid alignment

Alignment has reduced the likelihood of undesirable outcomes: harmful (request, response) pairs are rare under the model distribution, and we would be unlikely to encounter such behavior through standard sampling.

This changes once we condition on additional context. To illustrate this, we introduce a third variable: a modifier that changes how the request is phrased without changing its underlying intent. In practice, this could correspond to something like: “Let’s role-play—you are a superhero who must save the planet, and the only way to do so is to…” We represent such modifiers as animals, for example 🐰.

Note that while the probability of a harmful pair (P(🟦) = 0.006), and its joint probability with a specific modifier (P( 🟦 , 🐰) = 0.004) can be small, the conditional probability can be much higher.

(P( 🟦 \mid 🐰) = \frac{P( 🟦, 🐰)}{P(🐰)} \approx 0.260) Despite being rare overall, the harmful outcome becomes likely once we condition on 🐰. Low joint probability does not imply low conditional probability.

Why not simply defend against 🐰 during alignment? The problem is scale. In high-dimensional input spaces, combinatorially many modifiers (alternative phrasings, contexts, compositions) can shift the conditional distribution in similar ways. Alignment acts on a small set of examples, leaving vast regions of the input space weakly constrained. From the attacker’s perspective, finding one such unconstrained region is tractable. From the defender’s perspective, covering all such regions is not. 6

6 In the paper, we make a more precise argument about the volume of such regions, and their ratio.

Up to this point, we have only argued that conditioning variables (like 🐰) exist. Attackers must also find them—but this is not prohibitive. Because inputs can be iteratively refined, attackers can search the input space manually or automatically to discover prompts that induce desired behavior. This optimization problem seeks a prompt maximizing the likelihood of a specific outcome. The resulting prompts are called jailbreaks or prompt injections 7. Note that access to a model’s likelihoods is not necessary for such attacks.

LLMs became agentic #

Above we have established that harmful responses cannot be fully eliminated from the model distribution, and attackers can search for inputs that make such behaviors likely. When LLMs are chat companions, this is concerning but damage is somewhat limited. At worst, a model generates bad advice or inappropriate content—content that could in many cases be found elsewhere (Google, Reddit). This changes with agentic uses such as coding, research, UI, and general OS-level agents.

In these settings, the model does not just generate text, it acts, for instance by executing code. When using a coding agent such as Claude Code, the system executes actions based on model outputs: editing files, or running bash commands. More generally, this class of agents is called ReAct agents. Its actions are determined by the LLM’s output, which is determined by an input stream: system prompt, user instructions, tool calls, and retrieved content such as websites or documents.

ClaudeCodeis a ReAct agent.

The result is privilege erosion #

In classical computer security, severe vulnerabilities arise when data is interpreted as control. A canonical example is buffer overflow: user-provided input is written into memory without proper separation, allowing it to overwrite control structures such as return addresses. Similarly, in SQL injection, untrusted input is interpreted as part of a query, enabling attackers to modify the program’s behavior. In both, the root cause is the failure to maintain a clear boundary between data and control. Modern systems close this gap architecturally i.e. a return address is not executable data, and an SQL parameter is not parsed as syntax—through type systems, memory safety, and parameterized queries.

A ReAct agent reintroduces similar problems: Its instructions and the data it acts on—e.g., retrieved documents, tool outputs, web pages, git repositories—arrive through the same input stream. Hence, LLM systems collapse the control plane into the data plane.

While classical systems close the data/control vulnerabilities architecturally, LLM-based systems close it only statistically. Mitigation strategies such as learned instruction hierarchies 8 train the model to weight system prompts above user input, and user input above retrieved content - but the previous section showed that statistical boundaries are exactly what jailbreaks breach easily. An attacker who places a 🐰-style modifier anywhere in the input stream i.e. a webpage, a document, a git repo, any library; can shift the model toward following their instructions instead of the user’s. Hence, an AI agent operating with a defined privilege set (read, write, execute) may inadvertently propagate those privileges to any process with access to any portion of its input stream. Because there is no way to enforce that lower-trust inputs carry less weight than higher-trust instructions, AI agents cause Privilege Erosion across the entire system. Once an attacker can place content anywhere the agent reads from, they have a channel to its actions—without ever interacting with the system directly.

For people building applications, this changes the typical threat model. Software has always treated the operating system as a trusted, neutral foundation: the layer below your application is not your adversary. An agent that sits at that layer—reading messages, calendars, files—and is steerable by anything it reads breaks this assumption. The computer itself becomes part of the attack surface. Meredith Whittaker and Udbhav Tiwari made a similar argument at 39C3, describing how agentic access invalidates the threat models secure messaging apps are built on[ 9](#fn9).

9 [AI Agent, AI Spy](https://media.ccc.de/v/39c3-ai-agent-ai-spy), 39C3.

Where this plays out

A few examples of this principle that might seem familiar to readers;

Summer Yue / OpenClaw (February 2026) 10. Summer Yue, director of alignment at Meta Superintelligence Labs, granted an AI agent access to her email inbox and asked it to suggest what should be archived — but not to take any action. As the inbox filled the agent’s context window, compaction caused her earlier safety instruction to be silently discarded, after which the agent began mass-deleting emails and ignored repeated commands to stop. The safety constraint was not bypassed by an adversary — it was erased by the agent’s own internal memory management. Privilege erosion without even an attacker in the loop.

10 FastCompany article, No shade to Summer Yue — publicly sharing an embarrassing incident like this, especially in a role as visible as hers, takes real courage. That kind of openness is exactly how we learn.

Meta AI Support Agent (June 2026) 11. Attackers exploited Meta’s AI support chatbot by asking it to link a target Instagram account to an attacker-controlled email address, resulting in a wave of high-profile account takeovers — including the Obama White House page and Sephora. The chatbot reset account credentials without independently verifying identity, effectively turning a high-trust security tool into a vulnerability. The agent had write-level access to account settings, and anyone who could send it a message had a channel to those actions.

Mitigation: Separating control from data ?

One response is to stop relying on the model to maintain this boundary and enforce it architecturally instead, so called agent harnessing. CaMeL 12 wraps the LLM in a system that tracks which values came from untrusted sources and prevents them from influencing control flow, achieving provable guarantees on a specific agent benchmark.

12 See Defeating Prompt Injections by Design (CaMeL). This works when a task’s data and control can be separated in advance. Most agentic tasks, however, do not have this property: “finish the items on my todo list” requires the agent to read each item—data—and execute it as an instruction—control. The content of the list determines what the agent does next; that is the task. Architectural separation either falls back to the model’s judgment at exactly these points, or restricts agents to tasks where data never becomes control—a small fraction of what makes agents useful.

Another proposal is to gate LLM outputs: dangerous commands will not be executed. However, it is nearly impossible (or more precisely, computationally intractable) for a system defender to predict whether a given set of actions will result in dangerous or harmless behavior in any Turing-complete language, such as Python or Bash. This is because determining whether an arbitrary program will exhibit harmful behavior is equivalent to the halting problem, which Alan Turing proved to be undecidable in 1937 — meaning no general algorithm can exist that correctly classifies all possible programs as safe or unsafe 13.

13 You may read the original work, but honestly the wiki article is better gateway to the topic.

Conclusion #

My prediction: It is going to be worse before it gets better: Adversaries can hide malicious instructions in a variety of places: a web-browsing agent encounters hidden instructions embedded in a page it was asked to summarize, a RAG pipeline retrieves a document containing anything from anywhere, a coding agent incorporates adversarial instructions from a dependency or issue tracker into code it commits to a shared repository or a UI agent with access to email or messages acts on instructions embedded in a message it was never asked to act on — to just name a few threat surfaces.

Sharing the digital world with AI agents is likely gonna be in our future, but we need to seriously rethink and restructure how the infrastructure that we developed in the past 70 years must be modified to coexist with AI.

If you want to discuss these and other topics, consider joining our Discord server.

source & further reading

reliable-ai.review — original article