cd /news/ai-safety/ai-researchers-got-chatbots-to-share… · home topics ai-safety article
[ARTICLE · art-47215] src=decrypt.co ↗ pub= topic=ai-safety verified=true sentiment=↓ negative

AI Researchers Got Chatbots to Share Cocaine Recipes Using This One Wild Trick

Researchers tricked frontier AI models into generating cocaine synthesis instructions and leaking credentials using a new prompt injection attack called Chain-of-Thought Forgery. The attack exploits role confusion, where models mistake attacker-controlled text for their own reasoning. The technique achieved jailbreak success rates of about 60% across models including OpenAI's GPT-5 variants.

read3 min views1 publishedJul 2, 2026
AI Researchers Got Chatbots to Share Cocaine Recipes Using This One Wild Trick
Image: Decrypt (auto-discovered)

In brief

  • Researchers got frontier AI models to generate cocaine synthesis instructions using a new prompt injection attack.
  • The same technique manipulated an AI coding agent into up sensitive credentials.
  • The study argues prompt injection stems from "role confusion," not simply models failing to recognize malicious prompts.

Forget clever prompts: AI researchers say they tricked leading AI models into generating cocaine synthesis instructions by convincing them the dangerous ideas were their own, while also manipulating an AI coding agent into leaking sensitive credentials.

In the paper “Prompt Injection as Role Confusion,” presented at the International Conference on Machine Learning in June, researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell argue that both prompt injection attack demonstrations stem from a structural flaw in how large language models (LLMs) distinguish trusted instructions from untrusted text.

“For an LLM, everything arrives through the same channel as one long token soup,” the team wrote. “Its own thoughts sit next to your instructions, which sit next to the contents of a random webpage it just fetched.”

The paper also pointed to what the researcher called “role confusion,” with models relying on writing style rather than role tags to determine whether commands are trustworthy. Instead of recognizing attacker-controlled content as external input, the researchers found models can mistake it for legitimate user commands—or even their own internal reasoning.

“Think about it from the LLM's perspective. When it sees its prior think text, it implicitly trusts its conclusions. That's the whole point of reasoning: If the LLM had to re-derive the same conclusions, reasoning would be useless,” they wrote. “So think text gets a kind of blanket trust. Combined with our previous findings, this suggests that if you can make injected text sound like the model's reasoning, you can steal that trust.”

Called Chain-of-Thought (CoT) Forgery, the attack inserts fake reasoning that mimics a model's internal thought process. Models that would normally refuse illegal requests instead generated cocaine synthesis instructions after accepting the fabricated reasoning as their own.

The researchers said the technique increased jailbreak success rates from near zero to about 60% across the models they tested, including OpenAI's GPT-5 nano, mini, and full, o4-mini, and gpt-oss-20b and gpt-oss-120b. They also said it worked on GLM-4.6, Kimi-K2-Instruct, and MiniMax-M2.

In the experiment, the researchers said they were also able to trick an AI coding agent into up a SECRETS.env file after hiding malicious instructions in a webpage.

“Using our probes, we find that simply prepending 'User’ in front of the command causes the model to perceive the command as more likely to be genuine user text (i.e., higher Userness),” they wrote. “In other words, the attacker can just claim what role the text is, and the LLM believes it.”

The study comes as prompt injection attacks continue to expose weaknesses in AI agents. In April, Google researchers warned that malicious web pages were hiding invisible instructions designed to trick AI agents into leaking credentials, deleting files, and even sending PayPal payments.

In June, Microsoft disclosed a prompt injection vulnerability in Anthropic's Claude Code GitHub Action that could have exposed credentials stored in software development pipelines. Days later, another benchmark study found AI agents powered by GPT-5 and Gemini still failed the majority of prompt injection attacks, despite improvements in model capabilities.

── more in #ai-safety 4 stories · sorted by recency
── more on @openai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/ai-researchers-got-c…] indexed:0 read:3min 2026-07-02 ·