{"slug": "the-millions-of-songs-mashed-into-ai-generated-music", "title": "The Millions of Songs Mashed into AI-Generated Music", "summary": "A new investigation reveals four giant datasets of songs—one containing 12 million tracks—being shared within the AI-development community, used to train AI music generators like Suno. The datasets include hits from major artists such as Taylor Swift, Nirvana, and the Beatles, and have been linked to AI-generated songs that closely resemble copyrighted works, fueling lawsuits from record labels.", "body_md": "# The Millions of Songs Mashed Into AI-Generated Music\n\nExplore the astonishing amount of music available to AI developers.\n\nLast November, a pair of Olympic-bound figure skaters performed in a competition to a song with lyrics that sounded oddly familiar. “Every night we smash a Mercedes-Benz,” the singer began. It was one of several recognizable lines from the 1998 pop hit “You Get What You Give,” by the New Radicals. But the ice dancers’ song was otherwise different. The New Radicals’ message to angsty teenagers had been converted to Bon Jovi–style arena rock. If you knew “You Get What You Give,” this was a pretty strange variation on it.\n\nThe dancers had used music [generated by AI](https://techcrunch.com/2026/02/10/olympics-czech-ice-dancers-duo-ai-music/). Whatever model was involved had likely been trained on “You Get What You Give” and had copied some of the song’s content, as AI systems are [prone to do](https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/). Such systems don’t always reproduce elements of existing songs in this way, but you’ll hear it now and then, and sometimes even more blatantly. Suno, one of the most popular AI music generators, for example, has pumped out tracks that strongly resemble Michael Jackson’s “[Thriller](https://suno.com/song/25022486-74a7-44d4-aeb1-d7eff220d1df),” Ed Sheeran’s “[Shape of You](https://suno.com/song/9e2f436f-1b39-47e2-891e-4bd0caa23ca0),” Chuck Berry’s “[Johnny B. Goode](https://suno.com/song/16df3d1e-f817-4904-b9a8-eb6b18b6583d),” Bill Haley & His Comets’ “[Rock Around the Clock](https://suno.com/song/143b2e40-ddf5-4610-96eb-b60c53965cc0),” B. B. King’s “[The Thrill Is Gone](https://suno.com/song/788fb38f-4499-4869-9be1-f704b3ce8068),” and others. Listen to Michael Jackson’s song alongside a Suno-generated track titled “Thriller”:\n\n### Thriller\n\n### Thriller\n\n(“Thriller” is just one of the dozens of examples provided by the major record labels in a lawsuit against Suno. You can hear two others below. Rachel Racusen, a spokesperson for Suno, told me that the platform uses “safeguards to protect against unauthorized distribution, impersonation and manipulations,” and directed me to a [LinkedIn post](https://www.linkedin.com/posts/jack-brody-84936737_last-week-we-shared-that-were-now-testing-activity-7470203287804420096-V4tq) by the company’s chief product officer saying that reproductions of training data “should not happen.” Racusen did not answer questions about the lawsuit or acknowledge any specific tracks that were used to train their models.)\n\nCases like these indicate something about how AI-based music products work. AI music generators can simulate human performances with surprising fidelity, but first they have to be trained on enormous quantities of those human performances. The actual recordings that go into any model are a closely guarded secret—AI companies have claimed they are proprietary—but the number of songs is almost certainly huge, spanning genres and time periods.\n\nAs part of my series of [investigations into AI training data](https://www.theatlantic.com/category/ai-watchdog/), I recently discovered four giant datasets of songs that are being shared within the AI-development community. One has 12 million tracks. Another has 9 million. The two smaller datasets each have more than 100,000. They include hits from major pop artists such as Bad Bunny, Nirvana, Taylor Swift, Billie Eilish, Pearl Jam, Elvis Costello, Sheryl Crow, and the Beatles. (The New Radicals’ “You Get What You Give” is in two of the datasets.) Jazz artists such as Miles Davis, John Zorn, and Vijay Iyer are featured, as are classical composers and tens of thousands of minor artists across genres. The 12-million-track dataset, on its own, would take 91 years to listen to.\n\nYou can search for an artist in the datasets here:\n\nThese datasets are only four examples of the many sources available to AI developers. I found them by reading research papers published by developers and scouring AI data-sharing sites. The datasets have been downloaded thousands of times. [Google](http://arxiv.org/abs/2301.11325) has written about using one of them—more than 100,000 songs downloaded from the Free Music Archive, a site that allows free streaming for personal listening but requires payments for commercial use—to train AI models, and [Stability](https://arxiv.org/html/2407.14358) has used some songs from the same dataset. But because of the industry’s secrecy around training data, we don’t currently know who has used the others.\n\nWhat the datasets illustrate, primarily, is the scale and variety of music easily available to AI developers. Companies often claim to use only content that is freely available online, but the datasets reveal the quantity of downloadable music that developers can access even though it is not supposed to be free.\n\nThree of the datasets I found are distributed as a list of links to songs on YouTube or Spotify. AI developers download the actual audio using tools that automate the job, some of which allow developers to bypass logins, advertisements, and mechanisms that might earn money or subscribers for creators. Such tools violate the terms of service of these platforms. (The fourth dataset, the Free Music Archive collection, is distributed with MP3s.)\n\nThe datasets are similar in size to those that companies have used to train commercial-music-generating models. In 2022, Google [trained](http://arxiv.org/abs/2208.12415) a model on 44 million tracks, totaling 42 years of music. Suno [wrote](https://s3.documentcloud.org/documents/25023128/suno-legal-filing-8124.pdf) in a 2024 court filing that it trained its models on “essentially all music files of reasonable quality” that it could download from the internet. In 2020, OpenAI scraped 1.2 million songs from the web to train [a model](https://arxiv.org/abs/2005.00341) called Jukebox that was explicitly intended for generating [variations](https://soundcloud.com/openai_audio/classic-pop-in-the-style-of-elvis-presley?si=f995ebb6118d4168a0a5765440969925) on existing music.\n\nIn general, AI companies defend their right to train models on unlicensed music by arguing that the training is “fair use” under copyright law, meaning that AI models do not harm the market for creators’ work. This is a complex claim, and the legality likely depends on specifics of how an AI system is trained and deployed. Suno declined to comment on its legal arguments. Metin Parlak, a spokesperson for OpenAI, told me that the company has “always been transparent about how Jukebox was trained.” (The company published the procedure it used to train the model, though it did not list the songs.) Google also declined to comment for this article, but referred me to a [blog post](https://blog.google/innovation-and-ai/technology/ai/lyria-3-pro/) in which it says it has trained its audio-generating models on “materials that YouTube and Google has a right to use under our terms of service, partner agreements, and applicable law.” (YouTube is owned by Google.)\n\n### Shape of You\n\n### Girl, you know I want your love\n\nMusic-generating models work in a similar way to AI models that generate text: They break the training content down into tiny pieces (in this case, tiny snippets of audio rather than text) and “learn” about the context in which each piece appears. Then, when given a prompt (a context), they predict what piece comes next. The ease of generating AI music has quickly made it ubiquitous. Last September, Spotify [said](https://www.rollingstone.com/music/music-features/spotify-not-banning-ai-music-new-guidelines-1235434946/) it had removed 75 million “spammy” AI-generated tracks from its service. The streaming platform Deezer recently [reported](https://newsroom-deezer.com/2026/04/ai-generated-tracks-represent-44-of-new-uploaded-music/) that nearly half of the tracks it receives daily are AI generated. Unlike Spotify, Deezer excludes AI-generated tracks from its algorithmic recommendations, and labels albums that include AI tracks, although it does not display labels for individual tracks. Spotify does not label AI-generated music on its platform, nor does YouTube or Amazon Music.\n\nAmong the companies offering AI-music-generation products, Google is uniquely positioned to take advantage of a large existing audience. The tech giant has begun embedding the technology into its products: Google’s Gemini AI assistant can now generate 30-second music tracks based on a user’s uploaded text, photos, or video. And the company encourages video makers on YouTube to use AI-generated backing tracks, rather than licensing music from real musicians. For YouTubers who have gotten in trouble by using copyrighted music inappropriately, Google recently added a “Replace Song” button that will replace the music in their video with an AI-generated track.\n\nAI-generated music is being consumed directly on AI-product websites as well. Suno and its competitor Udio can be used as listening platforms much like Spotify or YouTube. The sites invite users to describe the music they want to hear, and can generate a track in a matter of seconds. The songs are mostly mundane, but can sound real enough that many listeners might struggle to recognize them as AI generated. (Udio did not respond to requests for comment.)\n\nIn an attempt to prevent their products from generating songs that duplicate existing music, AI companies implement detection software. But neither Suno nor Udio prevents users from generating songs in the style of real artists. Earlier this year, Sony [found](https://www.bbc.com/news/articles/cy57593gxe0o) 135,000 AI-generated tracks attributed to its artists on various streaming platforms. Although it’s not clear exactly which AI tools were used to generate those tracks, the technology is already harming artists’ ability to make a living from their music.\n\n### Johnny B. Goode\n\n### Deep down in Louisiana close to New Orle\n\nMusicians and labels have filed at least 12 lawsuits against AI companies for training models on copyrighted music. The music industry’s three major labels have sued both Suno and Udio, and others have sued Google, OpenAI, and smaller AI vendors. No rulings have been issued in these cases, but some of the labels have reached settlements with Suno and Udio.\n\nThe lawsuits allege copyright infringement, but even some artists who have chosen to share their music more freely still object to how AI companies are using their work. A case in point is the Free Music Archive. It was started in 2009 by the New Jersey radio station WFMU to serve the same purpose as radio—providing free music to listeners—but “designed for the age of the internet,” as the archive claimed on its original website. It’s a gold mine for rare, live, and non-mainstream recordings. And it’s a way for musicians to let listeners hear their music for free, typically while requiring that anyone who wants to make money from the music—say, by using it in a for-profit video—has to pay. Some artists also specify that their work cannot be used for commercial purposes.\n\nIn 2023, when Hessel van Oorschot, the head of Tribe of Noise, the company that operates the Free Music Archive, learned that Google was using FMA to train its AI models, he sent a letter demanding a discussion about consent and compensation. Van Oorschot described the response to me as “a big middle finger.” In a letter, which van Oorschot shared with me, Google refers to its privacy policy (which states that “we use publicly available information to help train Google’s AI models”) and goes on to argue that “we believe everyone benefits from a vibrant content ecosystem.” The company never directly addresses the Free Music Archive’s concerns.\n\nVan Oorschot, who is based in Amsterdam, told me he felt like he had no practical way to fight it. “For me to fly to America and start a lawsuit with Google” made no sense, he said.\n\nSome musicians have stopped sharing their music online because of concerns about their work being used against them by AI companies. Benn Jordan, a YouTuber who has made a living as a professional musician for more than 25 years, is one of them. He explained in [an April 2025 video](https://youtu.be/xMYm2d9bmEA) that he’d noticed tech companies were “scraping my music without my consent, then generating shittier music with it that is inadvertently associated with my name, and then attempting to resell that in the same economy in which I make money.” Jordan has developed a tool to “poison” generative-AI models. Essentially, his software adds noise to audio files that humans can’t hear but that confuses AI models. It’s the same technique used by some visual artists to fight the nonconsensual scraping of their work. The effectiveness of these tools has been debated, but researchers [have shown](http://arxiv.org/abs/2302.10149) that, in some cases, a few poisoned samples can significantly degrade an AI model.\n\nOn the Free Music Archive, the guitarist and singer Derek Clegg has been sharing his original, home-recorded songs for more than 15 years. Clegg told me he’s happy for people to put his music in the background of their personal videos, as long as they credit him. When people expect to make money from the use of his music, then they pay him for a license. More than 250 of Clegg’s songs are in the FMA dataset I found. I asked whether he would opt out of AI training if a mechanism for doing so existed. “Yeah, definitely,” he said.\n\nWhat bothers Clegg most is that AI companies take people’s music without consent, and without acknowledging that their tech products are entirely dependent on musicians. “It just seems dishonest. It seems like theft,” he said. “There’s going to have to be a reckoning.” That’s his hope, anyway.", "url": "https://wpnews.pro/news/the-millions-of-songs-mashed-into-ai-generated-music", "canonical_source": "https://www.theatlantic.com/technology/2026/06/ai-music-generators-suno-google-udio/687485/", "published_at": "2026-06-14 18:52:18+00:00", "updated_at": "2026-06-14 19:11:47.702166+00:00", "lang": "en", "topics": ["generative-ai", "ai-ethics", "ai-research", "ai-products"], "entities": ["Suno", "New Radicals", "Michael Jackson", "Taylor Swift", "Nirvana", "Beatles", "Billie Eilish", "Pearl Jam"], "alternates": {"html": "https://wpnews.pro/news/the-millions-of-songs-mashed-into-ai-generated-music", "markdown": "https://wpnews.pro/news/the-millions-of-songs-mashed-into-ai-generated-music.md", "text": "https://wpnews.pro/news/the-millions-of-songs-mashed-into-ai-generated-music.txt", "jsonld": "https://wpnews.pro/news/the-millions-of-songs-mashed-into-ai-generated-music.jsonld"}}