# LLM Prompt Injection & Guardrail Security

> Source: <https://dev.to/geekaara/llm-prompt-injection-guardrail-security-glm>
> Published: 2026-06-18 03:37:19+00:00

*A recall reference built from working through a 7-layer prompt-injection challenge. Focus: how each defense layer works, where it breaks, and most importantly how to defend.*

##
The one idea underneath everything

LLMs have **no hard boundary between instructions and data**. Everything in the context window — system prompt, user message, retrieved documents — is one stream of tokens the model interprets. Prompt injection exploits exactly this: attacker-controlled *data* gets read as *instructions*. You cannot fully **filter** your way out of it; you manage it with **defense-in-depth**, knowing each individual layer is bypassable.

##
The defense layers (and where each cracks)

A progression of controls from weakest to strongest, each with the lesson it teaches.

###
1–2. No / weak guardrails

Baseline: the model just answers. **Lesson:** an LLM holding secrets in its context with no controls will leak them on request.

###
3. Input filtering — block words in the user's message

-
**Defense:** scan the incoming prompt for banned terms ("code", "secret", "reveal") and block.
-
**Weakness:** keyword blocklists are trivially evaded — synonyms, misspellings, split words, leetspeak, another language, oblique references. Filtering *strings* doesn't filter *intent*.
-
**What actually helps:** prefer allowlists to blocklists; classify intent semantically rather than matching keywords; treat all input as untrusted; rate-limit and log probing.

###
4. Output filtering — catch the secret in the response

-
**Defense:** string-match the known secret in the model's output and redact.
-
**Weakness:** substring matching only catches the *contiguous* secret. Fragmenting or transforming it (separators, per-character, encodings) means the literal string never appears, so there is nothing to match.
-
**What actually helps:** don't put secrets where the model can emit them in the first place; minimize sensitive data in context; treat output filtering as a brittle last line, never a primary control.

###
5. Input + output filtering combined

-
**Defense:** both of the above, stacked.
-
**Weakness:** the weaknesses stack too — slip past the input filter with obfuscation, then past the output filter with fragmentation. Layering raises the bar but each layer is still individually defeatable.
-
**Lesson:** layering is good, but "more filters" is not the same as "secure."

###
6. Second LLM as a guardrail — semantic check

-
**Defense:** a separate model reads the output and censors it if it recognizes the secret. It understands *meaning*, not just strings, so it catches fragmentation and reversal.
-
**Weakness:** a reasoning judge can be **socially engineered** — reframe the secret so the judge believes it is harmless (e.g., "this code is expired / has changed"), or present it in a form the judge does not recognize. LLM-judging-LLM inherits all the same manipulability.
-
**What actually helps:** pair the LLM judge with deterministic checks; don't treat model-on-model moderation as airtight; constrain what the protected model can even access.

###
7. Human-in-the-loop review

-
**Defense:** a person reviews outgoing messages and redacts anything that reveals the secret.
-
**Weakness:** **humans see *** rendered* text, not raw bytes. Content can be hidden from human eyes while still being read by the model — this is **ASCII smuggling** (next section). The control fails *by construction*.
-
**What actually helps:** never rely on human review of rendered text alone; **sanitize and normalize the raw input stream before it reaches the model or the human.**

##
Deep dive: ASCII Smuggling (the interesting one)

**What it is.** An application-logic flaw that abuses the gap between the **display layer** (the UI renders certain characters as nothing) and the **raw data stream** (the model tokenizes everything, invisible characters included). Hidden text is embedded using characters invisible to humans but live to the LLM.

**The invisible vehicles:**

-
**Unicode Tags block (U+E0000–U+E007F)** — deprecated tag characters mirroring ASCII; invisible in essentially every renderer. The primary smuggling channel.
-
**Zero-width characters** — ZWSP (U+200B), ZWNJ (U+200C), ZWJ (U+200D), BOM / ZWNBSP (U+FEFF).
-
**Bidirectional controls (U+202A–U+202E, U+2066–U+2069)** — the "Trojan Source" family; reorder displayed text vs. logical order.

**Why it matters now.** LLMs are wired into email, calendars, documents, and RAG pipelines. Documented real-world impact (FireTail, Sept 2025) includes:

-
**Identity spoofing** — a tampered calendar invite whose hidden text rewrites the organizer; the assistant reads the spoofed identity and the victim never accepted the invite.
-
**Autonomous data exfiltration** — a hidden email instruction telling an inbox-connected assistant to search for and leak sensitive items.
-
**Content poisoning** — a product review with a hidden "visit scam-store…" that the summarizer surfaces as if it were customer consensus.

It bypasses the "Accept/Decline" gate and human review entirely. Their tests found **Gemini, Grok, DeepSeek** vulnerable, while **ChatGPT, Copilot, Claude** scrubbed the input.

**The key mental model:** this is **not a model jailbreak — it is a pipeline / UI flaw.** The fix lives in the application, not the model.

**Defenses:**

- Inspect and sanitize the
**raw payload the tokenizer receives**, not the rendered text.
- Strip Tags-block, zero-width, and control / format characters;
**NFKC-normalize**.
- Prefer an
**allowlist** of the Unicode categories you actually need over chasing bad ranges.
- Flag inputs where
**visible / printable length diverges sharply from the raw code-point count** — a strong "someone is probing me" signal.
- Apply all of this to
**retrieved / ingested content (RAG), not just user prompts** — a poisoned document is the same threat as a malicious message.
-
**Log** the anomalies and treat them as attack telemetry. (AWS has published guidance on Unicode-smuggling defenses; Google declined to act on the disclosure — so responsibility sits with the application owner.)

##
Cross-cutting principles to remember

-
**All input is untrusted** — including documents you retrieve and feed the model. RAG is a top injection vector.
-
**No instruction / data boundary** → you can't filter your way to safety; design assuming injection is possible.
-
**Defense-in-depth, with humility** — layer controls, but assume each is individually bypassable.
-
**Deterministic beats probabilistic** for security-critical checks where you can manage it; don't rely solely on an LLM or a human to "notice."
-
**Normalize bytes at the boundary** — before the model and before the human.
-
**Minimize secrets in context** — assume anything the model can see can eventually leak.

##
Staying current on this topic

**Frameworks & standards**

-
**OWASP Top 10 for LLM Applications / OWASP GenAI Security Project** — the canonical threat list; follow the latest revision.
-
**MITRE ATLAS** — adversarial threat landscape for AI systems (a catalog of attack techniques).
-
**NIST AI Risk Management Framework** — the governance / risk side.

**People & blogs worth following**

-
**Simon Willison** (simonwillison.net) — popularized the term "prompt injection"; ongoing, sharp coverage.
-
**Johann Rehberger / Embrace The Red** (embracethered.com) — deep on ASCII smuggling, data exfiltration, and AI-agent attacks.
-
**Lakera** (blog + the **Gandalf** game) — prompt-injection research and a great hands-on trainer.
-
**FireTail blog** — the ASCII-smuggling disclosure referenced above.
- Vendor security write-ups:
**Anthropic, OpenAI, Google, Microsoft**.

**Papers & alerts**

-
**arXiv cs.CR** for new research.
- Set
**Google Scholar / news alerts** for "prompt injection", "indirect prompt injection", "LLM security".

**Community & events**

-
**DEF CON AI Village**; security newsletters such as **tl;dr sec**; relevant Discords and subreddits.

**Hands-on practice (the best way to retain it)**

-
**Lakera Gandalf**, **Prompt Airlines**, **Secure Code Warrior** (what you just did), and AI-security labs as **HackTheBox / PortSwigger** roll them out; CTFs with AI categories.

**A sustainable habit:** follow ~3 of the people above, set one Scholar/news alert, and do one hands-on lab a month. That keeps the muscle memory fresh without drinking from the firehose.

*Reminder: these notes describe attack **classes** so you can defend against them. The real value is the defensive half — sanitize at the boundary, treat all input (including retrieved content) as untrusted, and never rely on a model or a human to simply "notice."*
