Your LLM can't read. Here's the weird trick it uses instead

A developer explains that large language models never read text directly; instead, they process tokenized integers via Byte Pair Encoding. The post details how tokens—chunks of text like 'Hello' or ' world'—affect billing and context windows, and warns that common words may be single tokens while rare ones split into multiple pieces.

Here's a fact that breaks people's mental model of large language models the first time they really sit with it: A language model never sees your words. Not one. It sees numbers — and only numbers. When you type Hello, world into ChatGPT, the model on the other end isn't reading English. By the time your text reaches the neural network, it's been chopped into chunks called tokens and each chunk has been swapped for an integer ID. The model is, underneath all the magic, a very expensive function that maps integers to integers. The "intelligence" is what happens in between. Let's actually look at it. python pip install tiktoken import tiktoken enc = tiktoken.get encoding "cl100k base" the GPT-4 era tokenizer ids = enc.encode "Hello, world" print ids - 9906, 11, 1917 print enc.decode i for i in ids - 'Hello', ',', ' world' Three tokens. Hello is one. The comma is its own token. And world ? It comes through as ' world' — with the leading space baked in. That space is part of the token. This is not a rounding error; it's central to how the whole thing works. A token is a frequent chunk of text. Not always a word, not always a letter — whatever the tokenizer found useful while it was trained on a mountain of text. Common words become single tokens. Rare words get shattered into pieces: for word in "playing", "tokenization", "antidisestablishmentarianism" : print word, "- ", enc.decode i for i in enc.encode word playing - 'playing' tokenization - 'token', 'ization' antidisestablishmentarianism - 'ant', 'idis', 'establish', 'ment', 'arian', 'ism' playing is so common it earns a single ID. tokenization splits into two. The long one gets diced into six. This is Byte Pair Encoding — an intimidating name for a refreshingly simple idea: start with characters, then repeatedly glue together the most common neighboring pair until you've built a vocabulary of ~50k–100k chunks. Frequent stuff ends up whole; rare stuff stays in pieces. Every model ships with its own frozen vocabulary, which is why a token count from one model doesn't transfer to another. Here's the part that bites people in production: you are billed in tokens, and your context window is measured in tokens — not characters, not words. And tokens are sneakier than they look. php print len enc.encode "123456789" - 3 numbers split oddly print len enc.encode " " - 1 whitespace is real print len enc.encode "hello" - 1 print len enc.encode " hello" - 1, but a DIFFERENT id than "hello" A few consequences that trip people up: "hello" and " hello" are different tokens. Q: and Q: as different starts.The practical move: count tokens before you send, not after you get the bill. len enc.encode prompt is the cheapest cost estimate you'll ever write, and it's also how you stop blowing past a context window at the worst possible moment. Almost every confusing LLM behavior has a tokenization fingerprint on it: Once you can see the tokens, a lot of "why is the model doing that?" turns into "oh, of course it's doing that." This is the first idea in a 10-part plain-English series I've been writing on how LLMs actually work under the hood — embeddings, attention, KV cache, quantization, RAG, the whole stack, no math degree required. If this scratched an itch, the full write-up with diagrams lives here: How Language Becomes Numbers . Now I'm genuinely curious: what's the weirdest tokenization edge case you've hit in production? Emoji that exploded into six tokens, a regex that broke on token boundaries, a non-English prompt that quietly 3×'d your bill? Drop it in the comments — I collect these.