Here's a fact that breaks people's mental model of large language models the first time they really sit with it:
A language model never sees your words. Not one. It sees numbers — and only numbers.
When you type Hello, world
into ChatGPT, the model on the other end isn't reading English. By the time your text reaches the neural network, it's been chopped into chunks called tokens and each chunk has been swapped for an integer ID. The model is, underneath all the magic, a very expensive function that maps integers to integers. The "intelligence" is what happens in between.
Let's actually look at it.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # the GPT-4 era tokenizer
ids = enc.encode("Hello, world")
print(ids) # -> [9906, 11, 1917]
print([enc.decode([i]) for i in ids]) # -> ['Hello', ',', ' world']
Three tokens. Hello
is one. The comma is its own token. And world
? It comes through as ' world'
— with the leading space baked in. That space is part of the token. This is not a rounding error; it's central to how the whole thing works.
A token is a frequent chunk of text. Not always a word, not always a letter — whatever the tokenizer found useful while it was trained on a mountain of text. Common words become single tokens. Rare words get shattered into pieces:
for word in ["playing", "tokenization", "antidisestablishmentarianism"]:
print(word, "->", [enc.decode([i]) for i in enc.encode(word)])
playing
is so common it earns a single ID. tokenization
splits into two. The long one gets diced into six. This is Byte Pair Encoding — an intimidating name for a refreshingly simple idea: start with characters, then repeatedly glue together the most common neighboring pair until you've built a vocabulary of ~50k–100k chunks. Frequent stuff ends up whole; rare stuff stays in pieces. Every model ships with its own frozen vocabulary, which is why a token count from one model doesn't transfer to another.
Here's the part that bites people in production: you are billed in tokens, and your context window is measured in tokens — not characters, not words. And tokens are sneakier than they look.
print(len(enc.encode("123456789"))) # -> 3 (numbers split oddly)
print(len(enc.encode(" "))) # -> 1 (whitespace is real)
print(len(enc.encode("hello"))) # -> 1
print(len(enc.encode(" hello"))) # -> 1, but a DIFFERENT id than "hello"
A few consequences that trip people up:
"hello"
and " hello"
are different tokens.Q:
and Q:
as different starts.The practical move: count tokens before you send, not after you get the bill. len(enc.encode(prompt))
is the cheapest cost estimate you'll ever write, and it's also how you stop blowing past a context window at the worst possible moment.
Almost every confusing LLM behavior has a tokenization fingerprint on it:
Once you can see the tokens, a lot of "why is the model doing that?" turns into "oh, of course it's doing that."
This is the first idea in a 10-part plain-English series I've been writing on how LLMs actually work under the hood — embeddings, attention, KV cache, quantization, RAG, the whole stack, no math degree required. If this scratched an itch, the full write-up with diagrams lives here: ** How Language Becomes Numbers**.
Now I'm genuinely curious: what's the weirdest tokenization edge case you've hit in production? Emoji that exploded into six tokens, a regex that broke on token boundaries, a non-English prompt that quietly 3×'d your bill? Drop it in the comments — I collect these.