# The invisible characters in your prompts aren't a conspiracy — they're a warning about your trust boundary

> Source: <https://dev.to/wrencalloway/the-invisible-characters-in-your-prompts-arent-a-conspiracy-theyre-a-warning-about-your-trust-3445>
> Published: 2026-07-01 08:33:02+00:00

The story making the rounds is that Claude Code has been caught "steganographically marking" requests — inserting invisible Unicode characters into the text that goes to the model. Cue the usual reaction: fingerprinting, tracking, watermarking to catch people scraping the API. Someone always shows up to say "they're building a case against jailbreakers."

Maybe. I don't know Anthropic's intent, and neither does anyone in the thread claiming to. But that argument is a rabbit hole, and it's the least interesting thing here. The interesting thing is much more boring and much more useful: **there is metadata riding inside your prompt text, in-band, indistinguishable from content, and almost none of the tools in your pipeline know it's there.**

That's not a scandal. It's a bug class. And it's one you can find in your own code by lunch.

The characters people point at are things like the variation selectors (U+FE00–FE0F and the supplement at U+E0100–U+E01EF), zero-width joiner (U+200D), zero-width space (U+200B), zero-width non-joiner (U+200C), bidi controls, and the deprecated tag block (U+E0000–U+E007F). In common editors and terminals these often render invisibly — as nothing, or as a modifier on the preceding glyph — so a human skimming sees clean prose. "Often," not "always": some renderers show replacement boxes, some reorder text, and an inspection tool will happily surface them. But in the default path most people read through, they're gone. They still survive copy-paste. They still count toward length. They can still carry bits.

The classic capacity trick abuses the tag block, not the standard variation selectors. The 128 tag characters (U+E0000–U+E007F) mirror ASCII, so you can encode arbitrary ASCII text one codepoint per byte and hang the whole payload off a single visible glyph — Paul Butler wrote up a clean version of this. Variation selectors are a *different* mechanism: 16 in FE00–FE0F, 240 in the supplement, meant to pick glyph variants, and people have repurposed them as a byte channel too. The point isn't which block you use. The point is the same underlying fact:

**Text is not a clean channel.** We pretend a string is "the words." It isn't. It's a sequence of codepoints, and a bunch of those codepoints are invisible-by-design. Any system that treats "text I can see" and "text I received" as the same thing has a hole in it.

The thread is arguing about *who is watermarking whom.* Wrong axis. Whether or not Anthropic is tagging requests, the same mechanism is available to anyone who can influence the text your LLM ingests — a webpage you scrape, a PDF a user uploads, a support ticket, a GitHub issue, a review your RAG system indexed.

That's the actual exposure. Invisible codepoints are a **prompt injection carrier.** You built a filter that blocks the string `ignore previous instructions`

. Here's that same string with a zero-width space wedged between every character:

```
payload = "ignore previous instructions"
smuggled = "\u200b".join(payload)
# often renders near-identically to a human eye,
# byte-for-byte different, and your substring filter never fires
```

Your regex looks for `ignore`

. The bytes say `i\u200bg\u200bn\u200bo\u200br\u200be`

. No match. Whether the model then *acts* on the smuggled instruction depends on tokenization and the model — I'm not going to claim it always recovers the phrase, because I haven't measured it across models and it varies. But you've already lost the useful property: your guardrail stops humans and forwards the adversary's bytes untouched.

Here's the real thing, not a hypothetical. A minimal "we sanitize input" filter that everyone writes:

``` python
import re

BLOCKLIST = re.compile(r"ignore (previous|all) instructions", re.IGNORECASE)

def is_safe(text: str) -> bool:
    text = text.strip()
    return BLOCKLIST.search(text) is None
```

Feed it the honest attack and it works:

```
>>> is_safe("please ignore previous instructions")
False   # blocked, good
```

Now feed it the smuggled variant:

```
>>> attack = "\u200b".join("ignore previous instructions")
>>> is_safe(attack)
True    # <- "safe." it isn't.
```

`.strip()`

and a visible-phrase blocklist is not sanitizing. It's decoration. The fix is to run detection on the *normalized* form, and the difference is one line:

``` php
def is_safe_v2(text: str) -> bool:
    cleaned = clean_for_prompt(text)   # defined below
    return BLOCKLIST.search(cleaned) is None
>>> is_safe_v2(attack)
False   # blocked
```

Same blocklist. The only thing that changed is *what bytes the check ran against.*

And here's the smaller demonstration that makes the whole class visible — `len()`

disagreeing with your eyes:

```
s = "hello\U000e0068\U000e0069world"  # two tag chars hidden in the middle

print(s)            # in a terminal that hides tag chars, prints: helloworld
print(len(s))       # 12, not 10
print([hex(ord(c)) for c in s if ord(c) > 0xFFFF])
# ['0xe0068', '0xe0069']  <- there's your smuggled payload
```

To be precise about what this proves: it shows hidden codepoints inflating the string's length in code points while the visible text looks clean. (`len()`

on a Python `str`

counts code points, not bytes — the byte count depends on the encoding.) It does *not* claim every invisible character renders identically or tokenizes identically — tag chars, ZWJ, and variation selectors behave differently. The shared property is the dangerous one: the string carries content a reviewer never saw.

That gap is the whole vulnerability. Every place in your stack where a human eyeballed a string and signed off — a prompt template, a "trusted" system message, an allowlisted document — is a place where the bytes could say something the reviewer never read.

Stop trying to detect malice. Detect **structure.** You don't need to know whether a hidden character is a watermark or an attack. You need to decide, per channel, whether invisible formatting codepoints belong there at all.

Normalize on the boundary. Here's a defensible starting filter:

``` python
import unicodedata

# Categories worth stripping from machine-ingested instruction text:
#   Cf = format (zero-width joiners/spaces, bidi controls, tag chars)
#   Cc = control (except the whitespace you actually want)
#   Co = private use
STRIP_CATEGORIES = {"Cf", "Cc", "Co"}
KEEP_CONTROLS = {"\n", "\t", "\r"}

def is_variation_selector(ch: str) -> bool:
    # Both blocks: basic (FE00-FE0F) and the supplement (E0100-E01EF).
    # These are category Mn, so the STRIP_CATEGORIES set does NOT catch
    # them — you have to name them explicitly or they slip through.
    cp = ord(ch)
    return 0xFE00 <= cp <= 0xFE0F or 0xE0100 <= cp <= 0xE01EF

def clean_for_prompt(text: str) -> str:
    # NFKC is COMPATIBILITY normalization: it folds compatibility variants
    # (e.g. fullwidth 'Ａ' -> 'A', ligatures, superscripts) AND composes.
    # That's deliberate here — it collapses lookalike-encodings attackers
    # lean on. It also mutates legitimate text, which is why this runs
    # ONLY at the untrusted->prompt boundary. Use NFC if you need to
    # preserve compatibility distinctions.
    text = unicodedata.normalize("NFKC", text)
    out = []
    for ch in text:
        if ch in KEEP_CONTROLS:
            out.append(ch)
            continue
        cat = unicodedata.category(ch)
        if cat in STRIP_CATEGORIES or is_variation_selector(ch):
            continue
        out.append(ch)
    return "".join(out)
```

The variation-selector line matters more than it looks. Both blocks are category `Mn`

(nonspacing mark), not `Cf`

— so `STRIP_CATEGORIES`

sails right past them, and a range check that only covers `FE00–FE0F`

leaves the 240-codepoint supplement (`E0100–E01EF`

) wide open. That supplement is exactly the byte channel the article opened with. Miss it and you've deployed a filter with a hole shaped like the attack you were defending against.

Then the part people skip: **log the delta, don't just drop it.** If a "trusted" document arrives with tag characters in it, that's a signal worth an alert, not a silent strip. One subtlety: since `clean_for_prompt`

runs NFKC *and* strips categories, a raw length comparison can't tell you *why* the length changed — NFKC alone can expand a ligature or fold a fullwidth glyph. If you want to claim invisibles were removed, count what the strip pass actually dropped:

``` php
def count_stripped(text: str) -> int:
    # Count only codepoints removed by the category/variation-selector
    # filter, ignoring NFKC's own length changes.
    text = unicodedata.normalize("NFKC", text)
    dropped = 0
    for ch in text:
        if ch in KEEP_CONTROLS:
            continue
        cat = unicodedata.category(ch)
        if cat in STRIP_CATEGORIES or is_variation_selector(ch):
            dropped += 1
    return dropped

def clean_and_flag(text: str, source: str) -> str:
    stripped = count_stripped(text)
    if stripped:
        log.warning("stripped %d invisible codepoints from %s", stripped, source)
    return clean_for_prompt(text)
```

Now the log line asserts a cause it actually measured. If you don't care about the breakdown, at least say what you know — `"normalization changed length by %d codepoints"`

— instead of blaming invisibles for a ligature expansion you did yourself.

Three rules that fall out of this:

**Normalize at trust boundaries, not everywhere.** Do it where untrusted text becomes prompt text. Don't NFKC-mangle a user's actual message history if they're allowed to type in Arabic or use combining accents — you'll corrupt legitimate content and flatten distinctions they care about. The distinction is *channel*, not *string*.

**Never write a security filter against visible text.** If your blocklist, PII scrubber, or moderation check runs on the raw input, it's blind to smuggled variants — see `is_safe`

above. Run detection on the *normalized* form, then decide what to forward.

**Assume your own system prompt could be a carrier.** If any part of your prompt is assembled from templates pulled from a CMS, a wiki, or a Git repo multiple people edit — audit those bytes once. `grep -P '[\x{200B}-\x{200F}\x{FE00}-\x{FE0F}\x{E0100}-\x{E01EF}\x{E0000}-\x{E007F}]'`

across the repo takes seconds and occasionally finds something a copy-paste dragged in.

And the answer to "should invisible formatting be allowed in prompt text?" is channel-specific, not universal. For machine-ingested instruction channels — system prompts, tool schemas, retrieved documents you treat as trusted — usually no. For user-visible multilingual content, often yes, because that's where these codepoints do legitimate work. Decide per channel; don't write one global rule and congratulate yourself.

Two things I won't overstate. First, I can't tell you Anthropic's motive, and I'd distrust anyone in that thread who claims certainty — "invisible characters appeared" is evidence of a mechanism, not of intent. It could be a tokenizer artifact, an accidental leak from an internal formatting layer, or deliberate tagging. All three produce the same bytes.

Second, stripping isn't free. Bidirectional scripts need the bidi controls. Some emoji sequences need ZWJ to render a family instead of four separate people — nuke `\u200d`

globally and you break 👨👩👧. Some legitimate emoji presentation even leans on `U+FE0F`

(the emoji variation selector), which the filter above strips — fine for an instruction channel, wrong for user-facing display text. That's exactly why "strip everything, everywhere" is the wrong lesson and "normalize deliberately at the boundary you designated untrusted" is the right one.

The takeaway isn't *someone is watching you.* It's older and duller: **a string is a byte sequence wearing a costume, and any pipeline that trusts the costume has a hole where the bytes are.** The watermarking story will fade in a week. The trust-boundary bug it's pointing at has been in your codebase the whole time.

Go run `len()`

against your eyes on one prompt template today. See which one wins.
