Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

wpnews.pro

You deploy a chatbot. English queries average 42 tokens each. Then a Spanish-speaking user sends "¿Cómo puedo restablecer mi contraseña?" and it eats 103 tokens. Two weeks later, the same model starts outputting "Ġcon" at the edges of its generations and you cannot tell if it is a bug or a feature. The finance team flags a 40% month-over-month cost increase that no one can explain.

This is what happens when tokenization is treated as invisible plumbing. Every major LLM pipeline uses one of four subword tokenization algorithms, and the choice determines vocabulary size, handling of rare words, cross-language efficiency, and inference cost. Understanding which one your model uses -- and why -- is the difference between shipping a cost-efficient product and discovering mid-quarter that your token-per-query ratio quietly doubled.

Tokenization directly controls three things that hit your bottom line:

Inference cost. LLM APIs charge by token. A model using a 32K-vocab BPE tokenizer may break "restablecer" into 8 tokens, while a 100K-vocab Unigram tokenizer handles it in 3. Over a million queries, that difference adds up to real money.

Vocabulary coverage. Rare words, code syntax, and multilingual text stress the tokenizer. A poorly fitting vocabulary means longer sequences, which means slower generation and higher cost.

Model behavior. The tokenizer is the model's entire view of language. If your tokenizer encodes "cowboy" as ["cow", "boy"], the model learns something different than if it encodes it as ["c", "owb", "oy"]. This affects everything from spelling ability to cross-lingual transfer.

Every modern tokenizer takes raw text, optionally pre-tokenizes it into words (splitting on whitespace and punctuation), then breaks words into subword units from a fixed-size vocabulary. The difference is in how that vocabulary is built and how segmentation decisions are made.

BPE was introduced in 1994 for data compression and adapted for neural machine translation by Sennrich et al. in 2016. OpenAI adopted it for GPT-2 and it remains the core of GPT-4o, Llama 3, and most modern LLMs.

How it works: Start with every individual character as a token. Count all adjacent token pairs, merge the most frequent pair into a new token, add it to the vocabulary, and repeat until you hit the target vocabulary size.

Vocabulary size goal: 16
Initial vocabulary: [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z,  , ., ,]
Training corpus: "low low low low low low low low lower lowest lowest lowest lowest lowest lowest lowest"

Step 1: Count pairs -> ("l", "o") appears 30 times, merge -> "lo"
Step 2: Count pairs -> ("lo", "w") appears 20 times, merge -> "low"
Step 3: Count pairs -> ("low", "e") appears 10 times, merge -> "lowe"
Step 4: Count pairs -> ("lowe", "r") appears 4 times, merge -> "lower"
Step 5: Count pairs -> ("low", "e") appears 6 times... wait, "low"+"e" appears
         in "lowest" fragments, merge -> "lowe" already exists, so merge "lowe"+"st"
...

BPE is greedy and deterministic: for any input, the segmentation is the same every time. The algorithm applies the learned merge rules in order. OpenAI's GPT-4o uses o200k_base

(200,096 tokens), GPT-4 used cl100k_base

(100,256 tokens), and GPT-2 used a 50,257-token vocabulary.

Who uses it: GPT-4o, GPT-4, GPT-3.5, Llama 2, Llama 3 (via SentencePiece), DeepSeek, Mistral.

Google introduced WordPiece for Japanese/Korean voice search in 2012, and it powered BERT in 2018. It is often described as "BPE but with likelihood instead of frequency."

How it works: The algorithm starts the same way as BPE -- character-level initial tokens -- but instead of counting raw frequencies, it merges the pair that maximizes the likelihood of the training data under the current vocabulary. In practice this means it picks the pair whose merge increases the corpus-likelihood the most.

Compare merge candidates:
  Merge ("a", "b") -> new token likelihood gain: 0.0032
  Merge ("th", "e") -> new token likelihood gain: 0.0417
  Merge ("ing", " ") -> new token likelihood gain: 0.0281

WordPiece picks ("th", "e") because the probability lift is largest.

The result is that WordPiece tends to create tokens that are more linguistically meaningful -- common prefixes, suffixes, and root words -- compared to BPE's purely frequency-driven merges.

Who uses it: BERT, DistilBERT, ELECTRA, and most encoder-only models from Google.

SentencePiece is a framework by Google (Kudo and Richardson, 2018) that wraps both BPE and Unigram tokenization. Its defining innovation: it operates directly on raw text without requiring a pre-tokenization step. Most tokenizers need whitespace/punctuation splitting before training, which ties them to a language-specific concept of "word." SentencePiece treats the input as a raw Unicode byte sequence, making it truly language-agnostic.

Raw text: "Hello世界"
With pre-tokenization: ["Hello", "世界"]  <- language-dependent
SentencePiece raw: "H", "e", "l", "l", "o", "世", "界"  <- no pre-tokenization needed

Who uses it: Llama 2, Llama 3, Gemma, T5, XLNet (in Unigram mode).

Unigram (Kudo, 2018) flips the problem around. Instead of greedily building up a vocabulary from characters, it starts with a large vocabulary of candidate tokens and prunes it down using a probabilistic model.

How it works: Unigram models each token as an independent event and learns a probability distribution over the vocabulary. The segmentation of a word is the sequence of tokens whose probabilities multiply to the highest score.

Vocabulary: {"UN": 0.02, "UNIC": 0.005, "NI": 0.01, "UNI": 0.015, ...}

Input: "UNICORN"
Candidate segmentations and their scores:
  UN + I + C + O + R + N  -> 0.02 * 0.03 * 0.04 * 0.02 * 0.01 * 0.02 = 1.92e-12
  UNI + C + O + R + N     -> 0.015 * 0.04 * 0.02 * 0.01 * 0.02 = 2.4e-10
  UNIC + O + R + N        -> 0.005 * 0.02 * 0.01 * 0.02 = 2.0e-9  <-- best

Unigram picks the highest-probability segmentation: UNIC + O + R + N

Because Unigram evaluates multiple candidate segmentations and chooses the best one probabilistically, it is slower to tokenize than BPE but produces more consistent token-to-meaning mappings. The probabilistic nature also enables subword regularization -- randomly sampling alternative segmentations during training to improve robustness.

Who uses it: T5, XLNet, ALBERT, and SentencePiece in Unigram mode.

Property	BPE	WordPiece	SentencePiece (BPE)	Unigram LM
Vocabulary building	Greedy merge by frequency	Greedy merge by likelihood	Greedy merge by frequency (same as BPE)	Start big, prune by likelihood
Pre-tokenization required	Yes (whitespace/punctuation)	Yes	No (raw bytes)	No (raw bytes)
Deterministic segmentation	Yes	Yes	Yes	No (sampling possible)
Typical vocab size	32K-200K	30K	32K-128K	32K-256K
Speed	Fast	Fast	Fast	Medium (Viterbi decoding)
Multilingual handling	Weak (needs large vocab)	Moderate	Best (byte-level)	Best (byte-level + sampling)
Rare word handling	Decomposes to chars	Decomposes to chars	Decomposes to bytes	Decomposes to subwords
Primary users	OpenAI, Meta, Mistral	Google (BERT)	Meta (Llama), Google (Gemma)	Google (T5, XLNet)

Here is a Python snippet using tiktoken

(OpenAI's BPE tokenizer library) to see how different inputs break apart:

import tiktoken

enc = tiktoken.get_encoding("o200k_base")

test_strings = [
    "Hello, world!",
    "restablecer",          # Spanish
    "Das ist fantastisch",  # German
    "こんにちは",            # Japanese
    "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
]

for s in test_strings:
    tokens = enc.encode(s)
    token_strs = [enc.decode([t]) for t in tokens]
    print(f"{s!r:45s} -> {len(tokens):3d} tokens: {token_strs[:6]}...")

Output (approximate for o200k_base

):

Hello, world!                                    -> 3 tokens: ['Hello', ',', ' world']
restablecer                                      -> 8 tokens: ['rest', 'able', 'cer', ...]
Das ist fantastisch                              -> 6 tokens: ['Das', ' ist', ' fant', 'ast', 'isch', ...]
こんにちは                                         -> 5 tokens: ['こ', 'ん', 'に', 'ち', 'は']
def fibonacci(n): return n if n <= 1 else ...   -> 22 tokens: ['def', ' fib', 'onacci', ...]

Notice how the Spanish word takes 8 tokens while an analogous English word of similar length might take 3-4. This is the cost asymmetry that shows up on your monthly bill.

Here is a diagram showing how a single word passes through each tokenizer type:

flowchart TD
    A["Input: 'unbelievable'"] --> B["Pre-tokenization<br/>(split on space/punct)"]
    B --> C{"Tokenizer type?"}

    C -->|BPE| D["Lookup in vocab: 'un' + 'believable'<br/>If 'believable' not found:<br/>'b' + 'el' + 'ievable' ...<br/>Greedy character-level fallback"]
    C -->|WordPiece| E["Lookup longest prefix: 'un'<br/>Try '##believable'<br/>If not found: '##b' + '##el' + ...<br/>Likelihood-based merging"]
    C -->|SentencePiece| F["Byte-level segmentation<br/>No pre-tokenization<br/>BPE merge rules on raw bytes<br/>'un' + 'bel' + 'ievable'"]
    C -->|Unigram| G["Score all candidate segmentations<br/>Pick highest-probability path<br/>'un' + 'believ' + 'able'<br/>Probabilistic, may vary"]

    D --> H["Output tokens"]
    E --> H
    F --> H
    G --> H

Assuming all tokenizers handle multilingual text equally. BPE-based tokenizers that rely on space-prefix pre-tokenization (like cl100k_base

) degrade significantly on CJK and Indic scripts where whitespace does not separate words. SentencePiece models handle these better because they operate at the byte level. If your user base spans non-Latin scripts, check your tokenizer's cross-language efficiency before picking a model.

Tying your prompt design to the wrong encoding. An instruction like "Output the result as JSON" costs 5 tokens with cl100k_base

but 7 tokens with o200k_base

. Developers who craft prompts for GPT-4 and then migrate to a model with a different tokenizer silently change the prompt's token boundary handoff, which can shift output quality.

Ignoring the tokenizer's role in fine-tuning. When you fine-tune a model, you can extend the vocabulary -- but doing so requires initializing new embedding vectors, and the model will behave unpredictably with the new tokens for the first few thousand steps. Most practitioners are better off using the existing vocabulary and handling out-of-vocabulary tokens via character-level fallback.

The "split on prefix space" trap. Most BPE tokenizers add a space before each word during pre-tokenization (byte-pair encoding operates on the string " Hello" not "Hello"). This means "Hello" (capitalized, start of sentence) and "hello" (lowercase, mid-sentence) share the same token " Hello" if the space prefix is consistent. But if your text formatting changes -- removing trailing spaces, using non-standard punctuation -- you can tokenize the same semantic content into dramatically more tokens.

Forgetting that tokenizer version matters. p50k_base

and cl100k_base

and o200k_base

all use BPE with different pre-tokenization rules and vocab sizes. A comparison of two models' outputs is meaningless if you used different tokenizers to count their tokens. Pin your tiktoken version (tiktoken==0.13.0

as of June 2026) and your encoding name in every evaluation script.

When you need exact character-level control. Tokenization destroys alignment between text characters and model internals. If you are building a spelling corrector, a character-level model (like ByT5 or CANINE) produces better results than any subword tokenizer.

When latency is the absolute priority. SentencePiece Unigram and WordPiece both require running a language model or Viterbi decoder to segment text. BPE is simpler and faster. If you are measuring single-digit millisecond TTFT budgets, use a pure BPE tokenizer and keep the vocabulary under 50K.

When you are building a single-language, domain-specific model. If your entire task is English medical text classification, you can build a custom BPE vocabulary (15K-20K tokens) that outperforms the general-purpose 100K vocabulary in both speed and perplexity. The general vocabularies are optimized for web-scale diversity, not domain density.

When you need reversible tokenization. Subword tokenization is lossy. You cannot reconstruct the original string perfectly from the token IDs if the tokenizer applied normalization (lowercasing, NFKC Unicode normalization, etc.). If you need byte-level round-trips, use a byte-level tokenizer (like the one in ByT5 or CANINE).

When you are benchmarking across model families. Comparing GPT-4o (200K vocab, BPE) against Llama 3 (32K vocab, SentencePiece BPE) by token count is comparing apples to oranges. Always benchmark on character or byte cost, not token cost, when models use different tokenizers.

cl100k_base

and o200k_base

can shift token counts by 15-30% on the same text.When you know which tokenizer your model uses, the next question is how to prepare your data so that tokenizer wastes as few tokens as possible. That means strategic prompt design, choosing the right model for your language mix, and building evaluation pipelines that measure token efficiency alongside accuracy. We will cover token-efficient prompt engineering in the next post -- including a concrete method for estimating your per-user token consumption before you deploy.

source & further reading

dev.to — original article Designing a Practical MiniMax H3 Video Workflow: Text, Frames, and Omni References I gave my Cursor agent real tools without five API keys Aeglix Mind

Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

Run your AI side-project on zahid.host