Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

A developer compares four subword tokenization algorithms—BPE, WordPiece, SentencePiece, and Unigram—used in major LLMs, explaining how tokenizer choice directly impacts inference cost, vocabulary coverage, and model behavior. The analysis shows that a poorly chosen tokenizer can double token-per-query ratios and inflate costs, as seen when a Spanish query consumed 103 tokens versus 42 for English. The post details how BPE (used by GPT-4o, Llama 3), WordPiece (BERT), SentencePiece (Llama 3), and Unigram differ in vocabulary building and segmentation, with practical implications for deployment.

You deploy a chatbot. English queries average 42 tokens each. Then a Spanish-speaking user sends "¿Cómo puedo restablecer mi contraseña?" and it eats 103 tokens. Two weeks later, the same model starts outputting "Ġcon" at the edges of its generations and you cannot tell if it is a bug or a feature. The finance team flags a 40% month-over-month cost increase that no one can explain. This is what happens when tokenization is treated as invisible plumbing. Every major LLM pipeline uses one of four subword tokenization algorithms, and the choice determines vocabulary size, handling of rare words, cross-language efficiency, and inference cost. Understanding which one your model uses -- and why -- is the difference between shipping a cost-efficient product and discovering mid-quarter that your token-per-query ratio quietly doubled. Tokenization directly controls three things that hit your bottom line: Inference cost. LLM APIs charge by token. A model using a 32K-vocab BPE tokenizer may break "restablecer" into 8 tokens, while a 100K-vocab Unigram tokenizer handles it in 3. Over a million queries, that difference adds up to real money. Vocabulary coverage. Rare words, code syntax, and multilingual text stress the tokenizer. A poorly fitting vocabulary means longer sequences, which means slower generation and higher cost. Model behavior. The tokenizer is the model's entire view of language. If your tokenizer encodes "cowboy" as "cow", "boy" , the model learns something different than if it encodes it as "c", "owb", "oy" . This affects everything from spelling ability to cross-lingual transfer. Every modern tokenizer takes raw text, optionally pre-tokenizes it into words splitting on whitespace and punctuation , then breaks words into subword units from a fixed-size vocabulary. The difference is in how that vocabulary is built and how segmentation decisions are made. BPE was introduced in 1994 for data compression and adapted for neural machine translation by Sennrich et al. in 2016. OpenAI adopted it for GPT-2 and it remains the core of GPT-4o, Llama 3, and most modern LLMs. How it works: Start with every individual character as a token. Count all adjacent token pairs, merge the most frequent pair into a new token, add it to the vocabulary, and repeat until you hit the target vocabulary size. Vocabulary size goal: 16 Initial vocabulary: a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, , ., , Training corpus: "low low low low low low low low lower lowest lowest lowest lowest lowest lowest lowest" Step 1: Count pairs - "l", "o" appears 30 times, merge - "lo" Step 2: Count pairs - "lo", "w" appears 20 times, merge - "low" Step 3: Count pairs - "low", "e" appears 10 times, merge - "lowe" Step 4: Count pairs - "lowe", "r" appears 4 times, merge - "lower" Step 5: Count pairs - "low", "e" appears 6 times... wait, "low"+"e" appears in "lowest" fragments, merge - "lowe" already exists, so merge "lowe"+"st" ... BPE is greedy and deterministic: for any input, the segmentation is the same every time. The algorithm applies the learned merge rules in order. OpenAI's GPT-4o uses o200k base 200,096 tokens , GPT-4 used cl100k base 100,256 tokens , and GPT-2 used a 50,257-token vocabulary. Who uses it: GPT-4o, GPT-4, GPT-3.5, Llama 2, Llama 3 via SentencePiece , DeepSeek, Mistral. Google introduced WordPiece for Japanese/Korean voice search in 2012, and it powered BERT in 2018. It is often described as "BPE but with likelihood instead of frequency." How it works: The algorithm starts the same way as BPE -- character-level initial tokens -- but instead of counting raw frequencies, it merges the pair that maximizes the likelihood of the training data under the current vocabulary. In practice this means it picks the pair whose merge increases the corpus-likelihood the most. php Compare merge candidates: Merge "a", "b" - new token likelihood gain: 0.0032 Merge "th", "e" - new token likelihood gain: 0.0417 Merge "ing", " " - new token likelihood gain: 0.0281 WordPiece picks "th", "e" because the probability lift is largest. The result is that WordPiece tends to create tokens that are more linguistically meaningful -- common prefixes, suffixes, and root words -- compared to BPE's purely frequency-driven merges. Who uses it: BERT, DistilBERT, ELECTRA, and most encoder-only models from Google. SentencePiece is a framework by Google Kudo and Richardson, 2018 that wraps both BPE and Unigram tokenization. Its defining innovation: it operates directly on raw text without requiring a pre-tokenization step . Most tokenizers need whitespace/punctuation splitting before training, which ties them to a language-specific concept of "word." SentencePiece treats the input as a raw Unicode byte sequence, making it truly language-agnostic. Raw text: "Hello世界" With pre-tokenization: "Hello", "世界" <- language-dependent SentencePiece raw: "H", "e", "l", "l", "o", "世", "界" <- no pre-tokenization needed Who uses it: Llama 2, Llama 3, Gemma, T5, XLNet in Unigram mode . Unigram Kudo, 2018 flips the problem around. Instead of greedily building up a vocabulary from characters, it starts with a large vocabulary of candidate tokens and prunes it down using a probabilistic model. How it works: Unigram models each token as an independent event and learns a probability distribution over the vocabulary. The segmentation of a word is the sequence of tokens whose probabilities multiply to the highest score. Vocabulary: {"UN": 0.02, "UNIC": 0.005, "NI": 0.01, "UNI": 0.015, ...} Input: "UNICORN" Candidate segmentations and their scores: UN + I + C + O + R + N - 0.02 0.03 0.04 0.02 0.01 0.02 = 1.92e-12 UNI + C + O + R + N - 0.015 0.04 0.02 0.01 0.02 = 2.4e-10 UNIC + O + R + N - 0.005 0.02 0.01 0.02 = 2.0e-9 <-- best Unigram picks the highest-probability segmentation: UNIC + O + R + N Because Unigram evaluates multiple candidate segmentations and chooses the best one probabilistically, it is slower to tokenize than BPE but produces more consistent token-to-meaning mappings. The probabilistic nature also enables subword regularization -- randomly sampling alternative segmentations during training to improve robustness. Who uses it: T5, XLNet, ALBERT, and SentencePiece in Unigram mode. | Property | BPE | WordPiece | SentencePiece BPE | Unigram LM | |---|---|---|---|---| | Vocabulary building | Greedy merge by frequency | Greedy merge by likelihood | Greedy merge by frequency same as BPE | Start big, prune by likelihood | | Pre-tokenization required | Yes whitespace/punctuation | Yes | No raw bytes | No raw bytes | | Deterministic segmentation | Yes | Yes | Yes | No sampling possible | | Typical vocab size | 32K-200K | 30K | 32K-128K | 32K-256K | | Speed | Fast | Fast | Fast | Medium Viterbi decoding | | Multilingual handling | Weak needs large vocab | Moderate | Best byte-level | Best byte-level + sampling | | Rare word handling | Decomposes to chars | Decomposes to chars | Decomposes to bytes | Decomposes to subwords | | Primary users | OpenAI, Meta, Mistral | Google BERT | Meta Llama , Google Gemma | Google T5, XLNet | Here is a Python snippet using tiktoken OpenAI's BPE tokenizer library to see how different inputs break apart: python import tiktoken GPT-4o uses o200k base encoding enc = tiktoken.get encoding "o200k base" test strings = "Hello, world ", "restablecer", Spanish "Das ist fantastisch", German "こんにちは", Japanese "def fibonacci n : return n if n <= 1 else fibonacci n-1 + fibonacci n-2 ", for s in test strings: tokens = enc.encode s token strs = enc.decode t for t in tokens print f"{s r:45s} - {len tokens :3d} tokens: {token strs :6 }..." Output approximate for o200k base : php Hello, world - 3 tokens: 'Hello', ',', ' world' restablecer - 8 tokens: 'rest', 'able', 'cer', ... Das ist fantastisch - 6 tokens: 'Das', ' ist', ' fant', 'ast', 'isch', ... こんにちは - 5 tokens: 'こ', 'ん', 'に', 'ち', 'は' def fibonacci n : return n if n <= 1 else ... - 22 tokens: 'def', ' fib', 'onacci', ... Notice how the Spanish word takes 8 tokens while an analogous English word of similar length might take 3-4. This is the cost asymmetry that shows up on your monthly bill. Here is a diagram showing how a single word passes through each tokenizer type: php flowchart TD A "Input: 'unbelievable'" -- B "Pre-tokenization<br/ split on space/punct " B -- C{"Tokenizer type?"} C -- |BPE| D "Lookup in vocab: 'un' + 'believable'<br/ If 'believable' not found:<br/ 'b' + 'el' + 'ievable' ...<br/ Greedy character-level fallback" C -- |WordPiece| E "Lookup longest prefix: 'un'<br/ Try ' believable'<br/ If not found: ' b' + ' el' + ...<br/ Likelihood-based merging" C -- |SentencePiece| F "Byte-level segmentation<br/ No pre-tokenization<br/ BPE merge rules on raw bytes<br/ 'un' + 'bel' + 'ievable'" C -- |Unigram| G "Score all candidate segmentations<br/ Pick highest-probability path<br/ 'un' + 'believ' + 'able'<br/ Probabilistic, may vary" D -- H "Output tokens" E -- H F -- H G -- H Assuming all tokenizers handle multilingual text equally. BPE-based tokenizers that rely on space-prefix pre-tokenization like cl100k base degrade significantly on CJK and Indic scripts where whitespace does not separate words. SentencePiece models handle these better because they operate at the byte level. If your user base spans non-Latin scripts, check your tokenizer's cross-language efficiency before picking a model. Tying your prompt design to the wrong encoding. An instruction like "Output the result as JSON" costs 5 tokens with cl100k base but 7 tokens with o200k base . Developers who craft prompts for GPT-4 and then migrate to a model with a different tokenizer silently change the prompt's token boundary handoff, which can shift output quality. Ignoring the tokenizer's role in fine-tuning. When you fine-tune a model, you can extend the vocabulary -- but doing so requires initializing new embedding vectors, and the model will behave unpredictably with the new tokens for the first few thousand steps. Most practitioners are better off using the existing vocabulary and handling out-of-vocabulary tokens via character-level fallback. The "split on prefix space" trap. Most BPE tokenizers add a space before each word during pre-tokenization byte-pair encoding operates on the string " Hello" not "Hello" . This means "Hello" capitalized, start of sentence and "hello" lowercase, mid-sentence share the same token " Hello" if the space prefix is consistent. But if your text formatting changes -- removing trailing spaces, using non-standard punctuation -- you can tokenize the same semantic content into dramatically more tokens. Forgetting that tokenizer version matters. p50k base and cl100k base and o200k base all use BPE with different pre-tokenization rules and vocab sizes. A comparison of two models' outputs is meaningless if you used different tokenizers to count their tokens. Pin your tiktoken version tiktoken==0.13.0 as of June 2026 and your encoding name in every evaluation script. When you need exact character-level control. Tokenization destroys alignment between text characters and model internals. If you are building a spelling corrector, a character-level model like ByT5 or CANINE produces better results than any subword tokenizer. When latency is the absolute priority. SentencePiece Unigram and WordPiece both require running a language model or Viterbi decoder to segment text. BPE is simpler and faster. If you are measuring single-digit millisecond TTFT budgets, use a pure BPE tokenizer and keep the vocabulary under 50K. When you are building a single-language, domain-specific model. If your entire task is English medical text classification, you can build a custom BPE vocabulary 15K-20K tokens that outperforms the general-purpose 100K vocabulary in both speed and perplexity. The general vocabularies are optimized for web-scale diversity, not domain density. When you need reversible tokenization. Subword tokenization is lossy. You cannot reconstruct the original string perfectly from the token IDs if the tokenizer applied normalization lowercasing, NFKC Unicode normalization, etc. . If you need byte-level round-trips, use a byte-level tokenizer like the one in ByT5 or CANINE . When you are benchmarking across model families. Comparing GPT-4o 200K vocab, BPE against Llama 3 32K vocab, SentencePiece BPE by token count is comparing apples to oranges. Always benchmark on character or byte cost, not token cost, when models use different tokenizers. cl100k base and o200k base can shift token counts by 15-30% on the same text.When you know which tokenizer your model uses, the next question is how to prepare your data so that tokenizer wastes as few tokens as possible. That means strategic prompt design, choosing the right model for your language mix, and building evaluation pipelines that measure token efficiency alongside accuracy. We will cover token-efficient prompt engineering in the next post -- including a concrete method for estimating your per-user token consumption before you deploy.