Neural Machine Translation for Low-Resource Tangkhul--English

wpnews.pro

cd /news/natural-language-processing/neural-machine-translation-for-low-r… · home › topics › natural-language-processing › article

[ARTICLE · art-38778] src=arxiv.org ↗ pub=2026-06-25T04:00Z topic=natural-language-processing verified=true sentiment=· neutral

Neural Machine Translation for Low-Resource Tangkhul--English

Researchers present a low-resource machine translation system for the Tangkhul-English language pair, achieving a BLEU score of 39.97 using a ByT5-large model fine-tuned on 38,336 parallel sentences. The study highlights orthographic challenges and domain bias in the training corpus, which consists of biblical, story, and conversational data.

read1 min views1 publishedJun 25, 2026

arXiv:2606.25365v1 Announce Type: new Abstract: We present a study on low-resource machine translation for the Tangkhul-English (nmf-en) language pair. Tangkhul is a severely under-resourced Tibeto-Burman language spoken primarily in Manipur, India, with virtually no prior natural language processing infrastructure. We describe two systems: (1) a primary system based on ByT5-large fine-tuned on 38,336 Tangkhul-English parallel sentence pairs, and (2) a contrastive system based on mT5-small fine-tuned on the same corpus. Our primary ByT5-large system achieves a corpus BLEU score of 39.97, chrF++ of 58.07, BERTScore F1 of 0.8104, and COMET (wmt22-comet-da) of 0.7302 on a held-out test set of 3,856 sentences. We further discuss the orthographic challenges specific to Tangkhul's Latin-script diacritics, the domain bias of our training corpus (which comprises biblical text, stories, and conversational data), and avenues for future improvement through data diversification and domain adaptation.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/neural-machine-translati…

Read original on arxiv.org → arxiv.org/abs/2606.25365

mentioned entities

Tangkhul

English

ByT5-large

mT5-small

Manipur

India

Tibeto-Burman

metadata

slugneural-machine-translation-for-low-resource-tangkhul-english

topic#natural-language-processing

secondary2 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevChinese models are sometimes bet…

next →Most teams will ship AI-written …

── more in #natural-language-processing 4 stories · sorted by recency

arxiv.org · 25 Jun · #natural-language-processing

Graph-Based Phonetic Error Correction of Noisy ASR

arxiv.org · 25 Jun · #natural-language-processing

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

arxiv.org · 25 Jun · #natural-language-processing

AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

arxiv.org · 25 Jun · #natural-language-processing

Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

── more on @tangkhul 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required