cd /news/artificial-intelligence/the-atlantic-reveals-21-million-song… · home topics artificial-intelligence article
[ARTICLE · art-38213] src=letsdatascience.com ↗ pub= topic=artificial-intelligence verified=true sentiment=↓ negative

The Atlantic Reveals 21 Million Songs Circulating in AI Datasets

The Atlantic's investigation, led by reporter Alex Reisner, found four public music datasets containing 21.2 million copyrighted recordings used in AI training. The largest datasets, LAION-DISCO-12M and Sleeping-DISCO-9M, include 12.6 million and 9 million tracks respectively, with Google and Stability AI identified as users of the Free Music Archive dataset. The findings raise concerns about consent and licensing in generative audio development.

read4 min views8 publishedJun 24, 2026
The Atlantic Reveals 21 Million Songs Circulating in AI Datasets
Image: Letsdatascience (auto-discovered)

An investigation published by The Atlantic, led by reporter Alex Reisner, found that four publicly circulating music datasets together contain roughly 21.2 million copyrighted recordings. Per The Atlantic, the largest collections are LAION-DISCO-12M (about 12.6 million tracks) and Sleeping-DISCO-9M (about 9 million tracks), with two smaller datasets including the Free Music Archive (around 100,000 tracks). The Atlantic also launched a free searchable tool, the AI Watchdog, that lets artists check whether their work appears in the collections. Reporting by DJ Mag and MusicTech notes that Google and Stability AI were identified as having drawn on the Free Music Archive dataset. The investigation highlights questions about consent, licensing, and how large-scale scraping is being used in generative audio development.

What happened

An investigation published by The Atlantic, led by reporter Alex Reisner, found that four music datasets circulating in the AI development ecosystem together contain approximately 21.2 million recordings. Per The Atlantic, the two largest datasets are LAION-DISCO-12M, containing about 12.6 million tracks, and Sleeping-DISCO-9M, containing about 9 million tracks. The remaining two collections include the Free Music Archive dataset at roughly 100,000 tracks and a second smaller compilation described in the reporting. The Atlantic made the collections searchable through a free tool called the AI Watchdog so artists, labels, and others can query whether specific works appear in those lists, according to the published report.

Technical details

Per The Atlantic's reporting summarized by MusicTech and WeRaveYou, LAION-DISCO-12M was assembled by LAION using an automated recursive crawl that matched seed artist lists to streaming URLs and was released under an Apache 2.0 licence. The Sleeping-DISCO-9M dataset was compiled by the Sleeping AI Research Collective and has also been hosted on platforms such as Hugging Face. The Atlantic's documentation and MusicTech note that many collections are distributed as lists of links rather than bundled audio files, and that developers commonly use automated download tools to retrieve audio at scale. MusicTech attributes to Reisner the observation that such download methods can bypass mechanisms that generate revenue for creators and may violate platform terms of service.

Industry context

Editorial analysis: Public reporting frames these findings as exposing a gap between how some developers describe training data and the practical realities of large-scale audio ingestion. The datasets mix commercially released music, independent releases, and Creative Commons material, which complicates questions of consent and compensation when the lists are repurposed for model training. Observers quoted across outlets show artists and producers reacting with concern after using the AI Watchdog tool to discover specific inclusions.

Context and significance

Editorial analysis: For practitioners building or auditing generative audio systems, the investigation highlights two persistent operational risks: dataset provenance and licensing ambiguity. Large link-based collections reduce friction for experimentation but also lower the barrier to ingesting commercially released content without explicit rights clearance. That pattern raises legal exposure for downstream users, and it increases the importance of provenance tracking, auditable licences, and supplier due diligence in dataset pipelines.

What to watch

Editorial analysis: Observers and rights holders will likely monitor three indicators:

  • •whether platform operators or major AI developers disclose more granular provenance for their audio training data
  • •legal or regulatory actions prompted by evidence surfaced through the AI Watchdog tool
  • •adoption of dataset filtering or provenance tooling by research groups and vendors. Reporting to date identifies Google andStability AI as having drawn on the Free Music Archive dataset, per DJ Mag and MusicTech; however, The Atlantic notes that pinpointing which commercial systems used the larger link-based collections is difficult because training data disclosures remain sparse

Implications for practitioners

Editorial analysis: Teams building generative audio models should treat large, publicly shared link lists as high-risk inputs until licence and provenance are validated. Structured approaches include maintaining link-level provenance metadata, prioritising openly licensed corpora, and integrating legal review into data ingestion workflows. The Atlantic's AI Watchdog provides an empirical starting point for rights holders seeking visibility, but it does not by itself resolve licensing or entitlement questions.

Overall, the reporting consolidates multiple public datasets and a searchable tool that together make the scale and composition of audio training material more visible, raising operational, legal, and ethical questions for researchers, vendors, and rights holders.

Scoring Rationale #

The story materially raises dataset-provenance and licensing risks that affect generative audio projects and compliance processes. It is notable for scale and visibility but stops short of announcing regulatory or platform-wide changes.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @the atlantic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/the-atlantic-reveals…] indexed:0 read:4min 2026-06-24 ·