cd /news/large-language-models/your-llm-can-t-read-here-s-the-weird… · home topics large-language-models article
[ARTICLE · art-25823] src=dev.to pub= topic=large-language-models verified=true sentiment=· neutral

Your LLM can't read. Here's the weird trick it uses instead

A developer explains that large language models never read text directly; instead, they process tokenized integers via Byte Pair Encoding. The post details how tokens—chunks of text like 'Hello' or ' world'—affect billing and context windows, and warns that common words may be single tokens while rare ones split into multiple pieces.

read3 min publishedJun 13, 2026

Here's a fact that breaks people's mental model of large language models the first time they really sit with it:

A language model never sees your words. Not one. It sees numbers — and only numbers.

When you type Hello, world

into ChatGPT, the model on the other end isn't reading English. By the time your text reaches the neural network, it's been chopped into chunks called tokens and each chunk has been swapped for an integer ID. The model is, underneath all the magic, a very expensive function that maps integers to integers. The "intelligence" is what happens in between.

Let's actually look at it.

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # the GPT-4 era tokenizer
ids = enc.encode("Hello, world")
print(ids)                       # -> [9906, 11, 1917]
print([enc.decode([i]) for i in ids])  # -> ['Hello', ',', ' world']

Three tokens. Hello

is one. The comma is its own token. And world

? It comes through as ' world'

with the leading space baked in. That space is part of the token. This is not a rounding error; it's central to how the whole thing works.

A token is a frequent chunk of text. Not always a word, not always a letter — whatever the tokenizer found useful while it was trained on a mountain of text. Common words become single tokens. Rare words get shattered into pieces:

for word in ["playing", "tokenization", "antidisestablishmentarianism"]:
    print(word, "->", [enc.decode([i]) for i in enc.encode(word)])

playing

is so common it earns a single ID. tokenization

splits into two. The long one gets diced into six. This is Byte Pair Encoding — an intimidating name for a refreshingly simple idea: start with characters, then repeatedly glue together the most common neighboring pair until you've built a vocabulary of ~50k–100k chunks. Frequent stuff ends up whole; rare stuff stays in pieces. Every model ships with its own frozen vocabulary, which is why a token count from one model doesn't transfer to another.

Here's the part that bites people in production: you are billed in tokens, and your context window is measured in tokens — not characters, not words. And tokens are sneakier than they look.

print(len(enc.encode("123456789")))   # -> 3   (numbers split oddly)
print(len(enc.encode("   ")))          # -> 1   (whitespace is real)
print(len(enc.encode("hello")))        # -> 1
print(len(enc.encode(" hello")))       # -> 1, but a DIFFERENT id than "hello"

A few consequences that trip people up:

"hello"

and " hello"

are different tokens.Q:

and Q:

as different starts.The practical move: count tokens before you send, not after you get the bill. len(enc.encode(prompt))

is the cheapest cost estimate you'll ever write, and it's also how you stop blowing past a context window at the worst possible moment.

Almost every confusing LLM behavior has a tokenization fingerprint on it:

Once you can see the tokens, a lot of "why is the model doing that?" turns into "oh, of course it's doing that."

This is the first idea in a 10-part plain-English series I've been writing on how LLMs actually work under the hood — embeddings, attention, KV cache, quantization, RAG, the whole stack, no math degree required. If this scratched an itch, the full write-up with diagrams lives here: ** How Language Becomes Numbers**.

Now I'm genuinely curious: what's the weirdest tokenization edge case you've hit in production? Emoji that exploded into six tokens, a regex that broke on token boundaries, a non-English prompt that quietly 3×'d your bill? Drop it in the comments — I collect these.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/your-llm-can-t-read-…] indexed:0 read:3min 2026-06-13 ·