Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

A new empirical study systematically compares tokenizers for multilingual large language models across 11 Southeast Asian languages, finding that Parity-aware BPE achieves the best balance of compression efficiency and cross-lingual equity, while Morphology-Driven Byte Encoding offers superior semantic reasoning at higher computational cost. The results show that cross-lingual fairness and tokenization efficiency are not fundamentally opposed, providing guidance for designing equitable multilingual models.

arXiv:2606.15044v1 Announce Type: new Abstract: Multilingual large language models LLMs depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding BPE tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.