# Your LLM can't read. Here's the weird trick it uses instead

> Source: <https://dev.to/xplaination/your-llm-cant-read-heres-the-weird-trick-it-uses-instead-47bi>
> Published: 2026-06-13 01:38:02+00:00

Here's a fact that breaks people's mental model of large language models the first time they really sit with it:

**A language model never sees your words. Not one. It sees numbers — and only numbers.**

When you type `Hello, world`

into ChatGPT, the model on the other end isn't reading English. By the time your text reaches the neural network, it's been chopped into chunks called **tokens** and each chunk has been swapped for an integer ID. The model is, underneath all the magic, a very expensive function that maps integers to integers. The "intelligence" is what happens in between.

Let's actually look at it.

``` python
# pip install tiktoken
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # the GPT-4 era tokenizer
ids = enc.encode("Hello, world")
print(ids)                       # -> [9906, 11, 1917]
print([enc.decode([i]) for i in ids])  # -> ['Hello', ',', ' world']
```

Three tokens. `Hello`

is one. The comma is its own token. And `world`

? It comes through as `' world'`

— **with the leading space baked in.** That space is part of the token. This is not a rounding error; it's central to how the whole thing works.

A token is a frequent chunk of text. Not always a word, not always a letter — whatever the tokenizer found useful while it was trained on a mountain of text. Common words become single tokens. Rare words get shattered into pieces:

```
for word in ["playing", "tokenization", "antidisestablishmentarianism"]:
    print(word, "->", [enc.decode([i]) for i in enc.encode(word)])

# playing                      -> ['playing']
# tokenization                 -> ['token', 'ization']
# antidisestablishmentarianism -> ['ant', 'idis', 'establish', 'ment', 'arian', 'ism']
```

`playing`

is so common it earns a single ID. `tokenization`

splits into two. The long one gets diced into six. This is **Byte Pair Encoding** — an intimidating name for a refreshingly simple idea: start with characters, then repeatedly glue together the most common neighboring pair until you've built a vocabulary of ~50k–100k chunks. Frequent stuff ends up whole; rare stuff stays in pieces. Every model ships with its own frozen vocabulary, which is why a token count from one model doesn't transfer to another.

Here's the part that bites people in production: **you are billed in tokens, and your context window is measured in tokens — not characters, not words.** And tokens are sneakier than they look.

``` php
print(len(enc.encode("123456789")))   # -> 3   (numbers split oddly)
print(len(enc.encode("   ")))          # -> 1   (whitespace is real)
print(len(enc.encode("hello")))        # -> 1
print(len(enc.encode(" hello")))       # -> 1, but a DIFFERENT id than "hello"
```

A few consequences that trip people up:

`"hello"`

and `" hello"`

are different tokens.`Q:`

and `Q:`

as different starts.The practical move: **count tokens before you send, not after you get the bill.** `len(enc.encode(prompt))`

is the cheapest cost estimate you'll ever write, and it's also how you stop blowing past a context window at the worst possible moment.

Almost every confusing LLM behavior has a tokenization fingerprint on it:

Once you can *see* the tokens, a lot of "why is the model doing that?" turns into "oh, of course it's doing that."

This is the first idea in a 10-part plain-English series I've been writing on how LLMs actually work under the hood — embeddings, attention, KV cache, quantization, RAG, the whole stack, no math degree required. If this scratched an itch, the full write-up with diagrams lives here: ** How Language Becomes Numbers**.

Now I'm genuinely curious: **what's the weirdest tokenization edge case you've hit in production?** Emoji that exploded into six tokens, a regex that broke on token boundaries, a non-English prompt that quietly 3×'d your bill? Drop it in the comments — I collect these.
