Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared A developer compares four subword tokenization algorithms—BPE, WordPiece, SentencePiece, and Unigram—used in major LLMs, explaining how tokenizer choice directly impacts inference cost, vocabulary coverage, and model behavior. The analysis shows that a poorly chosen tokenizer can double token-per-query ratios and inflate costs, as seen when a Spanish query consumed 103 tokens versus 42 for English. The post details how BPE (used by GPT-4o, Llama 3), WordPiece (BERT), SentencePiece (Llama 3), and Unigram differ in vocabulary building and segmentation, with practical implications for deployment. You deploy a chatbot. English queries average 42 tokens each. Then a Spanish-speaking user sends "¿Cómo puedo restablecer mi contraseña?" and it eats 103 tokens. Two weeks later, the same model starts outputting "Ġcon" at the edges of its generations and you cannot tell if it is a bug or a feature. The finance team flags a 40% month-over-month cost increase that no one can explain. This is what happens when tokenization is treated as invisible plumbing. Every major LLM pipeline uses one of four subword tokenization algorithms, and the choice determines vocabulary size, handling of rare words, cross-language efficiency, and inference cost. Understanding which one your model uses -- and why -- is the difference between shipping a cost-efficient product and discovering mid-quarter that your token-per-query ratio quietly doubled. Tokenization directly controls three things that hit your bottom line: Inference cost. LLM APIs charge by token. A model using a 32K-vocab BPE tokenizer may break "restablecer" into 8 tokens, while a 100K-vocab Unigram tokenizer handles it in 3. Over a million queries, that difference adds up to real money. Vocabulary coverage. Rare words, code syntax, and multilingual text stress the tokenizer. A poorly fitting vocabulary means longer sequences, which means slower generation and higher cost. Model behavior. The tokenizer is the model's entire view of language. If your tokenizer encodes "cowboy" as "cow", "boy" , the model learns something different than if it encodes it as "c", "owb", "oy" . This affects everything from spelling ability to cross-lingual transfer. Every modern tokenizer takes raw text, optionally pre-tokenizes it into words splitting on whitespace and punctuation , then breaks words into subword units from a fixed-size vocabulary. The difference is in how that vocabulary is built and how segmentation decisions are made. BPE was introduced in 1994 for data compression and adapted for neural machine translation by Sennrich et al. in 2016. OpenAI adopted it for GPT-2 and it remains the core of GPT-4o, Llama 3, and most modern LLMs. How it works: Start with every individual character as a token. Count all adjacent token pairs, merge the most frequent pair into a new token, add it to the vocabulary, and repeat until you hit the target vocabulary size. Vocabulary size goal: 16 Initial vocabulary: a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, , ., , Training corpus: "low low low low low low low low lower lowest lowest lowest lowest lowest lowest lowest" Step 1: Count pairs - "l", "o" appears 30 times, merge - "lo" Step 2: Count pairs - "lo", "w" appears 20 times, merge - "low" Step 3: Count pairs - "low", "e" appears 10 times, merge - "lowe" Step 4: Count pairs - "lowe", "r" appears 4 times, merge - "lower" Step 5: Count pairs - "low", "e" appears 6 times... wait, "low"+"e" appears in "lowest" fragments, merge - "lowe" already exists, so merge "lowe"+"st" ... BPE is greedy and deterministic: for any input, the segmentation is the same every time. The algorithm applies the learned merge rules in order. OpenAI's GPT-4o uses o200k base 200,096 tokens , GPT-4 used cl100k base 100,256 tokens , and GPT-2 used a 50,257-token vocabulary. Who uses it: GPT-4o, GPT-4, GPT-3.5, Llama 2, Llama 3 via SentencePiece , DeepSeek, Mistral. Google introduced WordPiece for Japanese/Korean voice search in 2012, and it powered BERT in 2018. It is often described as "BPE but with likelihood instead of frequency." How it works: The algorithm starts the same way as BPE -- character-level initial tokens -- but instead of counting raw frequencies, it merges the pair that maximizes the likelihood of the training data under the current vocabulary. In practice this means it picks the pair whose merge increases the corpus-likelihood the most. php Compare merge candidates: Merge "a", "b" - new token likelihood gain: 0.0032 Merge "th", "e" - new token likelihood gain: 0.0417 Merge "ing", " " - new token likelihood gain: 0.0281 WordPiece picks "th", "e" because the probability lift is largest. The result is that WordPiece tends to create tokens that are more linguistically meaningful -- common prefixes, suffixes, and root words -- compared to BPE's purely frequency-driven merges. Who uses it: BERT, DistilBERT, ELECTRA, and most encoder-only models from Google. SentencePiece is a framework by Google Kudo and Richardson, 2018 that wraps both BPE and Unigram tokenization. Its defining innovation: it operates directly on raw text without requiring a pre-tokenization step . Most tokenizers need whitespace/punctuation splitting before training, which ties them to a language-specific concept of "word." SentencePiece treats the input as a raw Unicode byte sequence, making it truly language-agnostic. Raw text: "Hello世界" With pre-tokenization: "Hello", "世界" <- language-dependent SentencePiece raw: "H", "e", "l", "l", "o", "世", "界" <- no pre-tokenization needed Who uses it: Llama 2, Llama 3, Gemma, T5, XLNet in Unigram mode . Unigram Kudo, 2018 flips the problem around. Instead of greedily building up a vocabulary from characters, it starts with a large vocabulary of candidate tokens and prunes it down using a probabilistic model. How it works: Unigram models each token as an independent event and learns a probability distribution over the vocabulary. The segmentation of a word is the sequence of tokens whose probabilities multiply to the highest score. Vocabulary: {"UN": 0.02, "UNIC": 0.005, "NI": 0.01, "UNI": 0.015, ...} Input: "UNICORN" Candidate segmentations and their scores: UN + I + C + O + R + N - 0.02 0.03 0.04 0.02 0.01 0.02 = 1.92e-12 UNI + C + O + R + N - 0.015 0.04 0.02 0.01 0.02 = 2.4e-10 UNIC + O + R + N - 0.005 0.02 0.01 0.02 = 2.0e-9 <-- best Unigram picks the highest-probability segmentation: UNIC + O + R + N Because Unigram evaluates multiple candidate segmentations and chooses the best one probabilistically, it is slower to tokenize than BPE but produces more consistent token-to-meaning mappings. The probabilistic nature also enables subword regularization -- randomly sampling alternative segmentations during training to improve robustness. Who uses it: T5, XLNet, ALBERT, and SentencePiece in Unigram mode. | Property | BPE | WordPiece | SentencePiece BPE | Unigram LM | |---|---|---|---|---| | Vocabulary building | Greedy merge by frequency | Greedy merge by likelihood | Greedy merge by frequency same as BPE | Start big, prune by likelihood | | Pre-tokenization required | Yes whitespace/punctuation | Yes | No raw bytes | No raw bytes | | Deterministic segmentation | Yes | Yes | Yes | No sampling possible | | Typical vocab size | 32K-200K | 30K | 32K-128K | 32K-256K | | Speed | Fast | Fast | Fast | Medium Viterbi decoding | | Multilingual handling | Weak needs large vocab | Moderate | Best byte-level | Best byte-level + sampling | | Rare word handling | Decomposes to chars | Decomposes to chars | Decomposes to bytes | Decomposes to subwords | | Primary users | OpenAI, Meta, Mistral | Google BERT | Meta Llama , Google Gemma | Google T5, XLNet | Here is a Python snippet using tiktoken OpenAI's BPE tokenizer library to see how different inputs break apart: python import tiktoken GPT-4o uses o200k base encoding enc = tiktoken.get encoding "o200k base" test strings = "Hello, world ", "restablecer", Spanish "Das ist fantastisch", German "こんにちは", Japanese "def fibonacci n : return n if n <= 1 else fibonacci n-1 + fibonacci n-2 ", for s in test strings: tokens = enc.encode s token strs = enc.decode t for t in tokens print f"{s r:45s} - {len tokens :3d} tokens: {token strs :6 }..." Output approximate for o200k base : php Hello, world - 3 tokens: 'Hello', ',', ' world' restablecer - 8 tokens: 'rest', 'able', 'cer', ... Das ist fantastisch - 6 tokens: 'Das', ' ist', ' fant', 'ast', 'isch', ... こんにちは - 5 tokens: 'こ', 'ん', 'に', 'ち', 'は' def fibonacci n : return n if n <= 1 else ... - 22 tokens: 'def', ' fib', 'onacci', ... Notice how the Spanish word takes 8 tokens while an analogous English word of similar length might take 3-4. This is the cost asymmetry that shows up on your monthly bill. Here is a diagram showing how a single word passes through each tokenizer type: php flowchart TD A "Input: 'unbelievable'" -- B "Pre-tokenization