QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

Researchers introduced QuechuaTok, a benchmark evaluating tokenization strategies for Southern Quechua, a low-resource agglutinative language. They found that BPE achieved the lowest fertility rate but only 6.67% morphological boundary accuracy, while the morphology-aware PRPE tokenizer reached 83.33% accuracy, showing that fertility rate alone is insufficient for evaluating tokenizers in such languages.

arXiv:2606.23943v1 Announce Type: new Abstract: Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark comparing four tokenization strategies - BPE, Unigram LM, WordPiece, and a morphology-aware PRPE tokenizer - for Southern Quechua quz , a low-resource agglutinative language spoken by 8-10 million people in South America. Using a 200k-sentence corpus and the SQUOIA finite-state morphological analyzer Rios, 2016 as silver standard, we evaluate three metrics: fertility rate, OOV rate, and morphological boundary accuracy MorphAcc . Our results show that BPE achieves the lowest fertility rate 1.636 at 16k vocab by memorizing surface word forms, while achieving only 6.67% MorphAcc. PRPE achieves 83.33% MorphAcc - the highest of all systems - demonstrating that fertility rate alone is insufficient to evaluate tokenizers for agglutinative languages. All code and models are publicly available at kaggle.com/code/macmaky/quechuatok