I tested 5 LLMs for prompt-injection leaks. Same code, 0% to 90%.

A developer built a scanner that fires prompt-injection probes at a self-hosted AI agent and tested it across five model backends, finding leak rates ranging from 0% to 90% depending solely on the model. The scanner detects two failure modes: leakage of real secret-shaped strings (e.g., API keys) and disclosure of the system prompt's content. Results showed OpenAI GPT-3.5 leaked 90% of the time, while Anthropic Claude Haiku 4.5 and xAI Grok-3 leaked 0%.

I built a scanner that fires prompt-injection probes at a self-hosted AI agent and checks whether it leaks a real secret-shaped strings API keys or b the content of its own system prompt. Then I ran the same agent across 5 model backends. The leak rate ranged from 0% to 90% depending only on the model. Here's what I found and how it works. Why this matters now Prompt injection is 1 on the OWASP 2025 LLM Top 10. It's not theoretical anymore: EchoLeak CVE-2025-32711, CVSS 9.3 — a zero-click flaw in Microsoft 365 Copilot. One crafted email could exfiltrate internal files and API keys with no user interaction. Notably, the payload bypassed Microsoft's prompt-injection classifier by reading like ordinary business text. A researcher showed the Devin coding agent could be driven to leak access tokens and install C2 malware via crafted prompts. Meanwhile ~90% of enterprises run LLMs but only ~5% feel confident securing them. Agents wired to tools and credentials widen the blast radius. The detection model Two stages, because they catch different failures: leak → a real secret-shaped string escaped sk-ant-…, AIza… prompt disclosure → no secret, but the hidden system prompt's content leaked leak = the guard handed over the vault key. prompt disclosure = the guard didn't give the key, but read the security manual aloud. Secrets are masked in the report sk-ant- , so output is safe to share. The 5-model matrix Same agent config, same probes, 10 runs each, leak rate: Model behind the agentOverall leak rateOpenAI gpt-3.50.9Google Gemini 2.5-flash0.7Mistral Small0.3xAI Grok-30.0Anthropic Claude Haiku 4.50.0 leak / 0.9 disclosure Takeaway: the backend model is a security decision. Same code, wildly different exposure. Two non-obvious results: Built-in demo targets leaky victim + clean/canary controls so a 0 means "actually safe," not "scanner broke." --handoff emits a masked report you paste into an AI to get the minimal fix. Honest status: scanning your own agent your URL/endpoint/code is in development — today it runs the built-in registry. Early WIP; I'm sharing the validation, not claiming a finished product. Open question If you ship a self-hosted AI agent — how do you check it for prompt/key leakage before deploy, if at all? Genuinely curious. Repo: https://github.com/ghkfuddl1327-wq/agentproof https://github.com/ghkfuddl1327-wq/agentproof Bring-your-own-agent waitlist: https://docs.google.com/forms/d/e/1FAIpQLSd57Pco1g1I41g59HT66txhL044IXnR6louu9CI22iI5Ukv6g/viewform https://docs.google.com/forms/d/e/1FAIpQLSd57Pco1g1I41g59HT66txhL044IXnR6louu9CI22iI5Ukv6g/viewform Sources: EchoLeak CVE-2025-32711 Aim Security / Microsoft MSRC; arXiv 2509.10540 ; Devin testing Embrace The Red ; OWASP 2025 LLM Top 10.