{"slug": "your-llm-can-t-read-here-s-the-weird-trick-it-uses-instead", "title": "Your LLM can't read. Here's the weird trick it uses instead", "summary": "A developer explains that large language models never read text directly; instead, they process tokenized integers via Byte Pair Encoding. The post details how tokens—chunks of text like 'Hello' or ' world'—affect billing and context windows, and warns that common words may be single tokens while rare ones split into multiple pieces.", "body_md": "Here's a fact that breaks people's mental model of large language models the first time they really sit with it:\n\n**A language model never sees your words. Not one. It sees numbers — and only numbers.**\n\nWhen you type `Hello, world`\n\ninto ChatGPT, the model on the other end isn't reading English. By the time your text reaches the neural network, it's been chopped into chunks called **tokens** and each chunk has been swapped for an integer ID. The model is, underneath all the magic, a very expensive function that maps integers to integers. The \"intelligence\" is what happens in between.\n\nLet's actually look at it.\n\n``` python\n# pip install tiktoken\nimport tiktoken\n\nenc = tiktoken.get_encoding(\"cl100k_base\")  # the GPT-4 era tokenizer\nids = enc.encode(\"Hello, world\")\nprint(ids)                       # -> [9906, 11, 1917]\nprint([enc.decode([i]) for i in ids])  # -> ['Hello', ',', ' world']\n```\n\nThree tokens. `Hello`\n\nis one. The comma is its own token. And `world`\n\n? It comes through as `' world'`\n\n— **with the leading space baked in.** That space is part of the token. This is not a rounding error; it's central to how the whole thing works.\n\nA token is a frequent chunk of text. Not always a word, not always a letter — whatever the tokenizer found useful while it was trained on a mountain of text. Common words become single tokens. Rare words get shattered into pieces:\n\n```\nfor word in [\"playing\", \"tokenization\", \"antidisestablishmentarianism\"]:\n    print(word, \"->\", [enc.decode([i]) for i in enc.encode(word)])\n\n# playing                      -> ['playing']\n# tokenization                 -> ['token', 'ization']\n# antidisestablishmentarianism -> ['ant', 'idis', 'establish', 'ment', 'arian', 'ism']\n```\n\n`playing`\n\nis so common it earns a single ID. `tokenization`\n\nsplits into two. The long one gets diced into six. This is **Byte Pair Encoding** — an intimidating name for a refreshingly simple idea: start with characters, then repeatedly glue together the most common neighboring pair until you've built a vocabulary of ~50k–100k chunks. Frequent stuff ends up whole; rare stuff stays in pieces. Every model ships with its own frozen vocabulary, which is why a token count from one model doesn't transfer to another.\n\nHere's the part that bites people in production: **you are billed in tokens, and your context window is measured in tokens — not characters, not words.** And tokens are sneakier than they look.\n\n``` php\nprint(len(enc.encode(\"123456789\")))   # -> 3   (numbers split oddly)\nprint(len(enc.encode(\"   \")))          # -> 1   (whitespace is real)\nprint(len(enc.encode(\"hello\")))        # -> 1\nprint(len(enc.encode(\" hello\")))       # -> 1, but a DIFFERENT id than \"hello\"\n```\n\nA few consequences that trip people up:\n\n`\"hello\"`\n\nand `\" hello\"`\n\nare different tokens.`Q:`\n\nand `Q:`\n\nas different starts.The practical move: **count tokens before you send, not after you get the bill.** `len(enc.encode(prompt))`\n\nis the cheapest cost estimate you'll ever write, and it's also how you stop blowing past a context window at the worst possible moment.\n\nAlmost every confusing LLM behavior has a tokenization fingerprint on it:\n\nOnce you can *see* the tokens, a lot of \"why is the model doing that?\" turns into \"oh, of course it's doing that.\"\n\nThis is the first idea in a 10-part plain-English series I've been writing on how LLMs actually work under the hood — embeddings, attention, KV cache, quantization, RAG, the whole stack, no math degree required. If this scratched an itch, the full write-up with diagrams lives here: ** How Language Becomes Numbers**.\n\nNow I'm genuinely curious: **what's the weirdest tokenization edge case you've hit in production?** Emoji that exploded into six tokens, a regex that broke on token boundaries, a non-English prompt that quietly 3×'d your bill? Drop it in the comments — I collect these.", "url": "https://wpnews.pro/news/your-llm-can-t-read-here-s-the-weird-trick-it-uses-instead", "canonical_source": "https://dev.to/xplaination/your-llm-cant-read-heres-the-weird-trick-it-uses-instead-47bi", "published_at": "2026-06-13 01:38:02+00:00", "updated_at": "2026-06-13 02:17:40.794153+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "developer-tools"], "entities": ["OpenAI", "ChatGPT", "tiktoken", "Byte Pair Encoding"], "alternates": {"html": "https://wpnews.pro/news/your-llm-can-t-read-here-s-the-weird-trick-it-uses-instead", "markdown": "https://wpnews.pro/news/your-llm-can-t-read-here-s-the-weird-trick-it-uses-instead.md", "text": "https://wpnews.pro/news/your-llm-can-t-read-here-s-the-weird-trick-it-uses-instead.txt", "jsonld": "https://wpnews.pro/news/your-llm-can-t-read-here-s-the-weird-trick-it-uses-instead.jsonld"}}