Your AI agent's leak risk depends more on the model than the prompt

wpnews.pro

(Early, single-configuration findings — read the caveats before you quote a number.)

If you ship a self-hosted AI agent, you've probably worried about the wrong thing. Most of us obsess over writing a tighter system prompt. But in a small measurement I ran this week, the single biggest factor in whether an agent spilled its hidden instructions wasn't the prompt at all — it was which model sat behind it. Same agent, same attack, five different models: the disclosure rate ranged from basically zero to nearly every single attempt. I want to share what I found, what I can't yet conclude, and why "it's just a prompt, who cares" is the wrong way to think about this.

What "system prompt leakage" even is

Every LLM-powered app runs on a hidden system prompt — the instructions that define the agent's role, its rules, sometimes its access scope, and (too often) credentials that were pasted in for convenience. The user is never supposed to see it. But through the right phrasing, a model can be coaxed into reciting it.

This isn't a fringe concern. The OWASP Top 10 for LLM Applications added System Prompt Leakage (LLM07) as its own category, and elevated sensitive-information disclosure to the #2 risk overall. As one security write-up puts it plainly: anything you treat as "hidden" inside an LLM context should be assumed extractable. (WitnessAI, LLM System Prompt Leakage, 2026.)

The harm is concrete, not hypothetical. Samsung engineers leaked internal source code and meeting notes simply by pasting them into public LLMs across three separate incidents. And a June 2026 paper studying real-world LLM apps responsibly disclosed prompt-leak vulnerabilities to vendors — two of whom officially classified them as medium-severity. (Yang et al., Understanding and Mitigating Prompt Leaking Attacks, arXiv:2606.18673, 2026.)

What I did

I grew this out of earlier red-team probing work and turned it into a small, deterministic measurement harness. The setup:

One deliberately vulnerable test agent with a fake planted credential and a couple of "canary" phrases in its system prompt — so a leak is unambiguous and nothing real is ever at risk.

Five widely used models that indie SaaS builders commonly wire up (a mix of "fast/cheap" text models and "brain"-tier reasoning models).

Two probe sets: a baseline set in one language, and a multilingual set that deliberately code-switches across three languages — to test whether mixing languages weakens a model's guardrails.

N = 5 runs per cell, with two detection signals: a language-invariant check for secret-shaped strings (the serious failure — a credential escaped), and a check for the canary phrases (the softer failure — it overshared its instructions).

I'm deliberately not publishing the probe strings themselves. The point is to measure exposure, not to hand anyone a copy-paste attack.

What I found

I'm not naming the models, on purpose. A single test configuration with five runs is not enough to brand a particular product as "the leaky one" — that would be exactly the kind of overclaim this project exists to avoid. What the data supports is the shape: model choice is a first-order security variable.

And this isn't just my harness. A prior study (Zhang & Ippolito, 2023) measured prompt-leak susceptibility across models and found wildly different rates — roughly 73%, 89%, and 82% for the three models they tested. Different models, different study, same lesson: the model matters, a lot.

One real (masked) credential leak fired on live data. Across 525 cells, exactly one produced a genuine secret-shaped leak — the planted fake key (sk-ant-****) came back verbatim from a model under a debug-roleplay probe. It happened on 1 of 5 runs of that cell, so it's intermittent, not deterministic — which is itself the point: a one-shot "looks safe" result can hide a real leak that only surfaces on retry.

Multilingual code-switching was a dud — and that's a finding too. My starting hypothesis was that mixing languages would weaken guardrails. The data didn't support it: the multilingual set produced no more leakage than the baseline. I'm reporting that honestly rather than burying it, because a hypothesis that doesn't pan out is still information.

What I can't conclude (the caveats that matter)

Single target, small N. One vulnerable configuration, five runs per cell. Treat every number as directional, not final. The 0%–96% range is real in this setup — it is not a general benchmark of any model.

Final output only. The harness inspects the model's returned message, not any hidden reasoning/thinking trace. A leak that occurs only inside an unexposed reasoning trace would not be caught here. (This is a known blind spot, not a clean bill of health.) Translation-evasion is real in principle but barely showed up live. Detecting a disclosure by string-matching a canary fails if the model translates the canary into another language on the way out. I confirmed this can happen in isolation — but in the live runs, models that disclosed mostly did so verbatim, so it was caught. The gap is real; the data just didn't exercise it much. Security research has long noted that output-monitoring defenses can be evaded by a model obfuscating or re-encoding its output (Zhang & Ippolito, 2023) — which is exactly why output filtering alone is not a guarantee.

The governance gap nobody budgets for

Here's the part that should make you uncomfortable. Regulation around AI is arriving fast, and it largely imposes duties on you — it does not protect you. Compliance is not security. Passing an audit doesn't mean your agent keeps its mouth shut, and if it leaks, the responsibility and the cleanup land on you regardless of what box you checked.

So treat leakage as a hygiene problem you own, not a paperwork problem someone else certifies. To be clear about scope: a pre-deploy leak check like this is a hygiene tool, not a compliance product, and nothing here is legal advice — talk to a real advisor about your obligations.

The takeaway

Assume anything hidden in your agent's context is extractable. Then actually test it — against the specific model you ship, because that choice matters more than you'd guess. Measure it, prove it, fix it.

If you want to poke at your own setup, the scanner I used is open source (link below). But the real ask isn't "install my tool" — it's stop assuming your system prompt is private. Run something. Measure your own stack. Share what you find. Tooling: agentproof-scan (Apache-2.0). Findings are preliminary — single configuration, N=5. If you can reproduce or break them, open an issue.

Github

([https://github.com/ghkfuddl1327-wq/agentproof](https://github.com/ghkfuddl1327-wq/agentproof))

([https://github.com/ghkfuddl1327-wq](https://github.com/ghkfuddl1327-wq))

source & further reading

dev.to — original article ReAct Inside — From Message to State, Understanding How AI Agents Really Work A practical release checklist for AI voice agents before they talk to real customers The LLM Should Never Do the Math

Your AI agent's leak risk depends more on the model than the prompt

Run your AI side-project on zahid.host