{"slug": "equity-with-efficiency-an-empirical-study-of-tokenizers-for-multilingual-large", "title": "Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models", "summary": "A new empirical study systematically compares tokenizers for multilingual large language models across 11 Southeast Asian languages, finding that Parity-aware BPE achieves the best balance of compression efficiency and cross-lingual equity, while Morphology-Driven Byte Encoding offers superior semantic reasoning at higher computational cost. The results show that cross-lingual fairness and tokenization efficiency are not fundamentally opposed, providing guidance for designing equitable multilingual models.", "body_md": "arXiv:2606.15044v1 Announce Type: new\nAbstract: Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.", "url": "https://wpnews.pro/news/equity-with-efficiency-an-empirical-study-of-tokenizers-for-multilingual-large", "canonical_source": "https://arxiv.org/abs/2606.15044", "published_at": "2026-06-16 04:00:00+00:00", "updated_at": "2026-06-16 04:23:11.017078+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "ai-ethics"], "entities": ["arXiv", "Byte-level Byte-Pair Encoding", "Parity-aware BPE", "Morphology-Driven Byte Encoding", "Byte Latent Transformer"], "alternates": {"html": "https://wpnews.pro/news/equity-with-efficiency-an-empirical-study-of-tokenizers-for-multilingual-large", "markdown": "https://wpnews.pro/news/equity-with-efficiency-an-empirical-study-of-tokenizers-for-multilingual-large.md", "text": "https://wpnews.pro/news/equity-with-efficiency-an-empirical-study-of-tokenizers-for-multilingual-large.txt", "jsonld": "https://wpnews.pro/news/equity-with-efficiency-an-empirical-study-of-tokenizers-for-multilingual-large.jsonld"}}