Open but Incompatible: A License Compatibility Analysis of Corpora for Low-Resource African Languages

wpnews.pro

cd /news/natural-language-processing/open-but-incompatible-a-license-comp… · home › topics › natural-language-processing › article

[ARTICLE · art-44331] src=arxiv.org ↗ pub=2026-06-30T04:00Z topic=natural-language-processing verified=true sentiment=· neutral

Open but Incompatible: A License Compatibility Analysis of Corpora for Low-Resource African Languages

A new audit of over twenty African NLP corpus families reveals widespread license incompatibilities, including CC-BY-SA and CC-BY-NC datasets that cannot be legally combined and NoDerivs clauses that prohibit tokenisation. The study documents four failure modes with primary-source evidence, such as the JW300 corpus removed from OPUS for Terms of Service violations and the WAXAL corpus misrepresenting its license. The paper provides a due diligence checklist and legally clean enrichment opportunities for low-resource African languages.

read1 min views1 publishedJun 30, 2026

arXiv:2606.28867v1 Announce Type: new Abstract: Creative Commons licenses dominate African NLP corpus releases, but their compatibility rules are rarely applied. CC-BY-SA and CC-BY-NC cannot be combined in a single published dataset; a NoDerivs clause silently prohibits tokenisation and annotation. This paper audits the license provenance of over twenty corpus families used in African NLP, constructs a six-tier compatibility matrix, and applies it to three case-study languages: Kituba/Munukutuba, Zarma, and Moore. Four failure modes are documented with primary-source evidence: outright prohibition (JW300, removed from OPUS after a legal audit confirmed Terms of Service violation); composite license misrepresentation (WAXAL, whose CC-BY 4.0 claim is contradicted by its own HuggingFace dataset card); a NoDerivs clause hidden behind a CC-BY label (Tanzil); and data persistence failure (the Congolese Radio Corpus, where 402 of 405 source URLs are now dead). A pre-annotation due diligence checklist and a survey of legally clean enrichment opportunities close the paper.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/open-but-incompatible-a-…

Read original on arxiv.org → arxiv.org/abs/2606.28867

mentioned entities

Creative Commons

OPUS

JW300

WAXAL

Tanzil

HuggingFace

Kituba

Zarma

metadata

slugopen-but-incompatible-a-license-compatibility-analysis-of-corpora-for-low

topic#natural-language-processing

secondary2 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevShow HN: We made an Audio ML sha…

next →OpenAI ads boss David Dugan on t…

── more in #natural-language-processing 4 stories · sorted by recency

smarterarticles.co.uk · 30 Jun · #natural-language-processing

A Number Is Not Evidence: How AI Detectors Punish Honest Students

koreatimes.co.kr · 30 Jun · #natural-language-processing

The promise and peril of AGI

politico.eu · 30 Jun · #natural-language-processing

For Europe to lead in AI, sovereignty must mean choice

ibtimes.co.uk · 30 Jun · #natural-language-processing

Drivers Sue BP, Walmart and 7-Eleven Alleging Ai-Powered Fuel Pricing Secretly Inflated Petrol Prices Across the US

── more on @creative commons 3 stories trending now

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 29 Jun · #ai-agents

I built 25 executable skills for AI coding agents �“ all open source

wpnews · 29 Jun · #large-language-models

The Silent Cost of AI Agents: Why Your Next.js SaaS Is Burning Money on LLM Calls

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required