Dark Mode Light Mode

Almost every Irish and international artist you listen to is in the AI music datasets training AI models

Ai Music Datasets Ai Music Datasets

Almost every music artist you can think of is included in the AI datasets uncovered and shared by The Atlantic, which could possibly be the sources for training AI platforms to create AI music.

What are you talking about Niall?


The Atlantic has published four searchable databases as part of its “AI Watchdog” investigation, mapping out which songs have possibly been used to train AI music generators like Suno, Udio and Google’s models. Together the databases cover more than 20 million tracks. The reporting comes from journalist Alex Reisner, who has spent the past few years tracking AI training data across books, research papers and now music.

How big are the datasets, and where do they come from?

Four datasets in total. The largest, LAION-DISCO-12M compiled by German nonprofit LAION, holds over 12.3 million tracks pulled from YouTube – around 91 years of music. A second dataset SLEEPING-DISCO-9M holds roughly 9 million tracks. Two smaller ones sit at around 100,000 songs each, with one built on the Free Music Archive, a project originally founded by New Jersey radio station WFMU. Most of these datasets don’t contain the actual audio files – just links to YouTube and Spotify, often gathered using tools that bypass logins, ads and the mechanisms that would normally earn money for the artist whose work is being scraped.

♡ Nialler9 is independent and reader-supported. Support us on Patreon →

Can I check if my own music is in there?


Yes. The tool requires no account, just a search box – enter an artist name, song title or album and it returns which dataset contains the material and, where known, when it was scraped. It’s already become something of a moment for individual musicians: checking your own name and seeing your own songs listed (or one of your favourite band’s songs) turns what was an abstract industry debate into something immediate and personal.

What Irish and international artists are on the datasets?

Irish artists including Lankum, CMAT, Pillow Queens, U2, The Script, Kettama, Sinead O’Connor, Kojaque, Gilla Band, Lisa Hannigan, Jazzy, Dermot Kennedy, Kneecap, James Vincent McMorrow, Khakikid, Rusangano Family, Curtisy and F3miii are in the datasets along with international artists like Olivia Rodrigo, Taylor Swift, Massive Attack, Tame Impala, Geese, Oklou, Lil Yachty, Dua Lipa, Charli XCX, Autechre, Boards Of Canada, Led Zeppelin, Nile Rodgers, The Doors and on and on – if it exists it’s probably referenced in the datasets.

Try it for yourself.

Does this prove any specific AI company used any specific dataset to train their models?

No, and The Atlantic is careful to say so. The investigation maps what is circulating within the AI-development community, not which company definitively trained on which dataset.

Google has stated its audio models were trained on “materials that YouTube and Google has a right to use under our terms of service whether that’s a dubious claim or not.” Udio has acknowledged using publicly available online audio to assemble training material, while disputing that doing so constitutes infringement. Hmmm.

What’s the legal status of all this?

Unresolved. Universal Music Group, Sony Music and Warner sued Suno and Udio in June 2024 over mass copyright infringement, and Universal and Sony recently sought to add more than 61,000 additional recordings to the case against Suno – a move Suno has opposed. Suno is defending itself on fair-use grounds, arguing training a generative model on copyrighted recordings is a transformative use under US copyright law.

A key summary-judgment hearing (RIAA vs. Suno & Udio) is scheduled for July 2026 before a judge in the District of Massachusetts, which could meaningfully shape the legal standard either way. As of now, none of the central legal questions have been decided on the merits.

Why does this matter beyond the lawsuits?

For years, AI music companies operated without disclosing their training data, asserting fair use while keeping the actual contents of their datasets secret. This investigation doesn’t settle the legal argument, but it does something the legal argument alone hasn’t managed – it makes the scale of the issue concrete and searchable by anyone, rather than an abstract claim that AI companies train on a generic “music from the internet” line.

A comprehensive licensing settlement between labels and the AI companies could resolve a lot of this uncertainty; a court ruling either way on the fair-use question would do the rest. Until then, the gap between what’s been scraped and what’s been licensed remains a pretty big gulf.

What can I do if…

You’re an artist and your music is in one of these datasets:

Document it. Take a screenshot of the search result showing your work in the database – this is now being cited as supporting evidence in the existing lawsuits and in smaller class actions filed by independent musicians who don’t have a label’s legal team behind them. If you’re with a label or PRS/PPI-style collecting society, flag it to them directly; several are actively building cases around this exact kind of evidence right now.

Check your contracts. If you’re signed, find out whether your label or publisher’s agreement gives them the right to pursue AI training claims on your behalf, or whether that right sits with you. This is a genuinely new area and a lot of older contracts don’t clearly address it either way.

Be sceptical of “blanket licensing” pitches. Some AI companies have proposed retroactive licensing deals modelled on radio or streaming royalties – a collective fee rather than individual consent. Musicians’ rights organisations are pushing back hard on this framing, arguing that training a model to generate new music is fundamentally different from playing an existing recording, and deserves explicit consent and proper compensation rather than a blanket buyout. Worth having a view on this before anyone asks you to sign something.

You’re a fan:

The most useful thing you can do is simple: keep paying attention to where your money actually goes. Buying directly from artists – Bandcamp, vinyl at shows, platforms like Subvert that route money straight to the artist – matters more in this climate than ever, because it’s the revenue stream furthest removed from the scraping and licensing mess upstream.

If you use AI music tools yourself, it’s worth knowing what you’re actually engaging with. Asking an AI generator for “a song that sounds like [your favourite artist]” isn’t a neutral creative act – it’s drawing, often quite directly, on that artist’s actual recorded work without their consent or payment. That doesn’t mean never touching these tools, but going in with eyes open changes how you use them.

And generally: support the artists and labels pushing for transparency rather than the platforms resisting it. The pressure that got us to a searchable database in the first place came from lawsuits, journalism and public attention – not from AI companies volunteering the information.

Support Independent Music Coverage

Enjoying Nialler9?

We've been covering Irish and international music independently since 2005. If you value what we do — discovering new music, gig guides, festival coverage — you can support us directly on Patreon for as little as €6 a month.

Join our Newsletter

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Add a comment Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
Maicin

Ireland Music Week 2026 announces 50 artists for this year's showcase