Every AI company talks about tokens.
Yet very few people actually know what a token is.
At first, I assumed a token was simply another word for a word. It turns out that’s not true. Once I understood tokens, almost every confusing part of modern AI suddenly made sense.
· Why AI Models Use Tokens Instead of Words
· Understanding Vectors and Embeddings
· Why Does the Model Use Embeddings?
· How the Model Understands Relationships
· How the Model Uses These Relationships
· Input Tokens vs Output Tokens
· Why Different Models Count Tokens Differently
· Why Learning About Tokens Matters
A token is a small piece of text that an AI model processes.
Think of tokens as the LEGO bricks of language.
Humans read:
I love pizza
Model reads something closer to:
[I][love][pizza][.]
Each piece becomes a token. Sometimes a token is:
The important thing is that AI models don’t actually read words the way humans do. They read tokens.
Human language is incredibly messy. Consider the below words,
If an AI stored every possible word separately, its vocabulary would become unimaginably large. Instead, modern language models reuse smaller pieces. It’s very similar to building with LEGO.
Rather than manufacturing thousands of different castles, LEGO creates a relatively small collection of bricks that can be combined into almost anything.
Language models do exactly the same thing.
Instead of memorizing every possible word they learn reusable pieces that can be combined in millions of different ways. This makes the vocabulary smaller, more efficient, and much better at handling words the model has never seen before.
The process of splitting text into tokens is called tokenization.
Imagine we type:
Machine learning is amazing.
The tokenizer might split it into:
[Machine][Learn][ing][is][amaz][ing][.]
Each token is then converted into a unique numerical ID.
Something like:
Machine -> 5342Learn -> 8912ing -> 221is ->321amaz -> 7641ing -> 221. -> 13
After tokenization, the model no longer processes raw text. Instead, it works with numerical representations of tokens.* *During inference, it operates on embeddings derived from those token IDs.
Once tokenization is complete, the text has effectively been transformed into numbers that the model can process mathematically.
Those token IDs are then mapped to vectors, called embeddings, allowing the model to recognize relationships and patterns between tokens during both training and inference.
A token ID is just a label. To understand meaning, the model needs something much richer.
Now let’s slow down and unpack this idea, because it’s one of the most important concepts in AI.
A vector is simply an ordered list of numbers.
For example:
[0.2, -1.5, 3.1, 0.7]
Instead of representing a token with just one number (like ID = 5342), the model represents it using hundreds or even thousands of numbers together.
This list of numbers is called an embedding vector, or simply an embedding. So,
Models use embeddings because a single number can identify a token, but it cannot capture its meaning. During training, the model gradually adjusts these vectors so that words appearing in similar contexts end up close together.
The model isn’t explicitly told that cat and dog are related. Instead, it learns that relationship by seeing billions of examples during training.
Each number in the vector captures a small aspect of meaning. You can think of it like describing a word from many different angles:
You don’t see these directly, but the model learns them during training.
Embeddings tell the model what each token means. Attention tells the model which earlier tokens are most important while predicting the next one.
Together, embeddings and attention allow the model to understand both the meaning of individual tokens and how they relate to one another within a sentence.
Now imagine every word placed somewhere in a huge invisible map.
Words with similar meanings end up close to each other:
Instead of memorizing entire sentences, the model learns how words relate to one another based on where their embeddings are located on this map.
When the model reads a sentence, it doesn’t just look at individual words — it looks at how they relate to each other. It recognizes patterns it has learned during training.
Example 1: Food Prediction
I am eating ___
The model has seen many sentences like this during training.
Words like pizza, rice, burger
often appear after I am eating.
So the model predicts something related to food.
Example 2: Country Knowledge
The capital of France is ___
The model has learned that France is strongly connected to Paris.
So it predicts Paris.
Example 3: Everyday Pattern
She is drinking ___
The model has seen patterns like, drinking water, drinking juice, and drinking coffee
So it predicts something that fits naturally after drinking.
Key Idea
The AI model does not understand grammar the way humans do.
It is simply:
That’s how it works.
Imagine you’re traveling abroad. You don’t speak the local language. You rely on a translator.
Similarly when you write a sentence, the model cannot understand words directly. So first, the sentence is broken into small pieces. Then each piece is turned into numbers.
The model only works with those numbers.
Most beginners ignore tokens. Then they hit an AI limit and get confused.
Suppose a model has a context window of 1,28,000 tokens. That means everything must fit inside,
Once the limit is exceeded, older information begins to fall out of the context window. This is why the model may seem to forget something mentioned earlier. In reality, it isn’t forgetting, it simply no longer has that information available in its working memory.
You can think of a context window as the model’s working memory. Imagine a whiteboard. Every token written on the white board takes up space.
Small prompts use little space. Large documents use lots of space. Eventually the whiteboard becomes full. When that happens, older information has to be erased to make room for new tokens.
That’s essentially how context windows work.
People often ask:
How many words equal one token?
There isn’t an exact answer because tokenization varies. For English, a useful approximation is:
Different languages can produce very different token counts. Languages with complex scripts or longer words may require more or sometimes fewer tokens than English.
When using an AI API, you’ll usually see two separate token counts.
Input Tokens
Everything you send to the model. Example,
Explain machine learning in simple terms.
These are input tokens.
Output Tokens
Everything generated by the model. Example,
Machine learning is a method that allows computers to learn from data instead of being explicitly programmed…
These are output tokens.
Most providers calculate usage using both values.
Almost every AI provider charges per token. Because tokens are a practical approximation of the computational work required to process a request.
For example,
The larger the token count, the more processing the model performs.
A short question with a brief answer costs very little.
Asking the model to summarize a 100-page document or write a detailed report requires significantly more computation, so it consumes more tokens and costs more.
Here’s something many beginners don’t realize. There is no universal tokenizer. Each AI model uses its own tokenization strategy and vocabulary.
The exact same sentence might become:
This is why token counts differ across models such as:
Even though the sentence is identical, each tokenizer splits the text differently based on how it was designed and trained.
One token equals one word
Not always. A token may represent a full word, part of a word, punctuation, or even an emoji.
The AI model reads English
Not directly. The model processes numerical token IDs not human-readable text.
All models count tokens the same way
Each model uses its own tokenizer, so token counts vary.
Understanding tokens helps explain many common AI behaviors.
Once you know what tokens are, you’ll better understand:
In many ways, tokens are the foundation of every modern Large Language Model.
Tokens are the first concept every AI learner should master.
Once you understand them, ideas like embeddings, context windows, prompt engineering, and AI pricing stop feeling mysterious.
In many ways, every conversation with an AI model begins with a token. They are the building blocks of language for AI.
Every question you ask, every document you upload, and every response you receive is ultimately broken into tokens, converted into numbers, and processed mathematically. Those numbers are then mapped to embeddings, allowing the model to capture meaning and relationships between words.
The model doesn’t think the way humans do. Instead, it learns patterns from vast amounts of data and uses those patterns to predict what comes next.
Understanding tokens may seem like a small step, but it’s the foundation for understanding how modern AI models really work.
What Is a Token? ChatGPT’s Smallest Building Block Explained Simply was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.