cd /news/natural-language-processing/open-but-incompatible-a-license-comp… · home topics natural-language-processing article
[ARTICLE · art-44331] src=arxiv.org ↗ pub= topic=natural-language-processing verified=true sentiment=· neutral

Open but Incompatible: A License Compatibility Analysis of Corpora for Low-Resource African Languages

A new audit of over twenty African NLP corpus families reveals widespread license incompatibilities, including CC-BY-SA and CC-BY-NC datasets that cannot be legally combined and NoDerivs clauses that prohibit tokenisation. The study documents four failure modes with primary-source evidence, such as the JW300 corpus removed from OPUS for Terms of Service violations and the WAXAL corpus misrepresenting its license. The paper provides a due diligence checklist and legally clean enrichment opportunities for low-resource African languages.

read1 min views1 publishedJun 30, 2026

arXiv:2606.28867v1 Announce Type: new Abstract: Creative Commons licenses dominate African NLP corpus releases, but their compatibility rules are rarely applied. CC-BY-SA and CC-BY-NC cannot be combined in a single published dataset; a NoDerivs clause silently prohibits tokenisation and annotation. This paper audits the license provenance of over twenty corpus families used in African NLP, constructs a six-tier compatibility matrix, and applies it to three case-study languages: Kituba/Munukutuba, Zarma, and Moore. Four failure modes are documented with primary-source evidence: outright prohibition (JW300, removed from OPUS after a legal audit confirmed Terms of Service violation); composite license misrepresentation (WAXAL, whose CC-BY 4.0 claim is contradicted by its own HuggingFace dataset card); a NoDerivs clause hidden behind a CC-BY label (Tanzil); and data persistence failure (the Congolese Radio Corpus, where 402 of 405 source URLs are now dead). A pre-annotation due diligence checklist and a survey of legally clean enrichment opportunities close the paper.

── more in #natural-language-processing 4 stories · sorted by recency
── more on @creative commons 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/open-but-incompatibl…] indexed:0 read:1min 2026-06-30 ·