Your LLM can't read. Here's the weird trick it uses instead

wpnews.pro

cd /news/large-language-models/your-llm-can-t-read-here-s-the-weird… · home › topics › large-language-models › article

[ARTICLE · art-25823] src=dev.to ↗ pub=2026-06-13T01:38Z topic=large-language-models verified=true sentiment=· neutral

Your LLM can't read. Here's the weird trick it uses instead

A developer explains that large language models never read text directly; instead, they process tokenized integers via Byte Pair Encoding. The post details how tokens—chunks of text like 'Hello' or ' world'—affect billing and context windows, and warns that common words may be single tokens while rare ones split into multiple pieces.

read3 min views21 publishedJun 13, 2026

Here's a fact that breaks people's mental model of large language models the first time they really sit with it:

A language model never sees your words. Not one. It sees numbers — and only numbers.

When you type Hello, world

into ChatGPT, the model on the other end isn't reading English. By the time your text reaches the neural network, it's been chopped into chunks called tokens and each chunk has been swapped for an integer ID. The model is, underneath all the magic, a very expensive function that maps integers to integers. The "intelligence" is what happens in between.

Let's actually look at it.

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # the GPT-4 era tokenizer
ids = enc.encode("Hello, world")
print(ids)                       # -> [9906, 11, 1917]
print([enc.decode([i]) for i in ids])  # -> ['Hello', ',', ' world']

Three tokens. Hello

is one. The comma is its own token. And world

? It comes through as ' world'

— with the leading space baked in. That space is part of the token. This is not a rounding error; it's central to how the whole thing works.

A token is a frequent chunk of text. Not always a word, not always a letter — whatever the tokenizer found useful while it was trained on a mountain of text. Common words become single tokens. Rare words get shattered into pieces:

for word in ["playing", "tokenization", "antidisestablishmentarianism"]:
    print(word, "->", [enc.decode([i]) for i in enc.encode(word)])

playing

is so common it earns a single ID. tokenization

splits into two. The long one gets diced into six. This is Byte Pair Encoding — an intimidating name for a refreshingly simple idea: start with characters, then repeatedly glue together the most common neighboring pair until you've built a vocabulary of ~50k–100k chunks. Frequent stuff ends up whole; rare stuff stays in pieces. Every model ships with its own frozen vocabulary, which is why a token count from one model doesn't transfer to another.

Here's the part that bites people in production: you are billed in tokens, and your context window is measured in tokens — not characters, not words. And tokens are sneakier than they look.

print(len(enc.encode("123456789")))   # -> 3   (numbers split oddly)
print(len(enc.encode("   ")))          # -> 1   (whitespace is real)
print(len(enc.encode("hello")))        # -> 1
print(len(enc.encode(" hello")))       # -> 1, but a DIFFERENT id than "hello"

A few consequences that trip people up:

"hello"

and " hello"

are different tokens.Q:

and Q:

as different starts.The practical move: count tokens before you send, not after you get the bill. len(enc.encode(prompt))

is the cheapest cost estimate you'll ever write, and it's also how you stop blowing past a context window at the worst possible moment.

Almost every confusing LLM behavior has a tokenization fingerprint on it:

Once you can see the tokens, a lot of "why is the model doing that?" turns into "oh, of course it's doing that."

This is the first idea in a 10-part plain-English series I've been writing on how LLMs actually work under the hood — embeddings, attention, KV cache, quantization, RAG, the whole stack, no math degree required. If this scratched an itch, the full write-up with diagrams lives here: ** How Language Becomes Numbers**.

Now I'm genuinely curious: what's the weirdest tokenization edge case you've hit in production? Emoji that exploded into six tokens, a regex that broke on token boundaries, a non-English prompt that quietly 3×'d your bill? Drop it in the comments — I collect these.

source & further reading

dev.to — original article AgentENV: Distributed Runtime for AI Agents at Scale (Open Source, Rust) I Made REGENT: An MCP Server for Configuring OpenWrt Routers Through an AI Physics-Augmented Diffusion Modeling for satellite anomaly response operations with embodied agent feedback loops

~/api · this article 200

$curl api.wpnews.pro/v1/news/your-llm-can-t-read-here…

Read original on dev.to → dev.to/xplaination/your-llm-cant-read-heres-the-…

mentioned entities

OpenAI

ChatGPT

tiktoken

Byte Pair Encoding

metadata

slugyour-llm-can-t-read-here-s-the-weird-trick-it-uses-instead

topic#large-language-models

secondary2 topics

sentimentneutral

canonicaldev.to

navigation

← prevAWS confirms all other Anthropic…

next →Technical Debt Has a New Cost Ce…

── more in #large-language-models 4 stories · sorted by recency

cephalosec.com · 28 Jul · #large-language-models

Cybersecurity harnesses everywhere

dev.to · 28 Jul · #large-language-models

MCP Servers Are Bringing Live SEO Data to AI Keyword Research Workflows

github.com · 28 Jul · #large-language-models

Running Kimi K3 on a M1 Mac

gizmodo.com · 28 Jul · #large-language-models

I Gave the Hardest Cryptic Crossword I Could Find to a Bunch of LLMs

── more on @openai 3 stories trending now

wpnews · 26 Jul · #artificial-intelligence

Nobel laureate Simon Johnson on the AI race and China’s ‘over-automation’ problem

wpnews · 26 Jul · #artificial-intelligence

China’s Moonshot, Z.AI, and DeepSeek are challenging U.S. AI labs—and beating them on cost

wpnews · 26 Jul · #ai-safety

University of Washington study reveals prompt injection risks lurking in AI agent memory

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required