cd /news/natural-language-processing/quechuatok-morphological-boundary-ac… · home topics natural-language-processing article
[ARTICLE · art-37189] src=arxiv.org ↗ pub= topic=natural-language-processing verified=true sentiment=· neutral

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

Researchers introduced QuechuaTok, a benchmark evaluating tokenization strategies for Southern Quechua, a low-resource agglutinative language. They found that BPE achieved the lowest fertility rate but only 6.67% morphological boundary accuracy, while the morphology-aware PRPE tokenizer reached 83.33% accuracy, showing that fertility rate alone is insufficient for evaluating tokenizers in such languages.

read1 min views2 publishedJun 24, 2026

arXiv:2606.23943v1 Announce Type: new Abstract: Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark comparing four tokenization strategies - BPE, Unigram LM, WordPiece, and a morphology-aware PRPE tokenizer - for Southern Quechua (quz), a low-resource agglutinative language spoken by 8-10 million people in South America. Using a 200k-sentence corpus and the SQUOIA finite-state morphological analyzer (Rios, 2016) as silver standard, we evaluate three metrics: fertility rate, OOV rate, and morphological boundary accuracy (MorphAcc). Our results show that BPE achieves the lowest fertility rate (1.636 at 16k vocab) by memorizing surface word forms, while achieving only 6.67% MorphAcc. PRPE achieves 83.33% MorphAcc - the highest of all systems - demonstrating that fertility rate alone is insufficient to evaluate tokenizers for agglutinative languages. All code and models are publicly available at kaggle.com/code/macmaky/quechuatok

── more in #natural-language-processing 4 stories · sorted by recency
── more on @quechuatok 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/quechuatok-morpholog…] indexed:0 read:1min 2026-06-24 ·