Atlantic reporter Alex Reisner has identified four datasets of music being used to train artificial intelligence models and built a fully searchable public database from them. The disclosure puts previously opaque training pipelines under direct scrutiny, with Google and Stability AI both having confirmed use of the datasets in published research papers.
Scale of the Datasets
Two of the four collections are vast by any measure. One contains roughly 12 million tracks; a second holds approximately 9 million. The remaining two are smaller but still substantial, each exceeding 100,000 songs. Taken together, the datasets represent a sweeping cross-section of recorded music that AI developers have been able to pull from, often without public acknowledgment.
The datasets have been downloaded thousands of times, according to Reisner's reporting. Because access logs do not capture downstream usage, it is not possible to determine the full list of organizations or researchers who have drawn on them.
Confirmed Use by Major AI Players
Google and Stability AI are the two companies that have publicly acknowledged using the datasets, each doing so within research papers. Neither confirmation came as a voluntary disclosure ahead of Reisner's investigation; both surfaced through the academic record. The acknowledgments are significant because they connect household AI brands directly to specific music collections whose licensing terms were not designed with model training in mind.
The Free Music Archive dataset, one of the sources Reisner identified, is cleared for personal streaming use. However, the terms governing redistribution or use in AI training pipelines are a separate matter — the distinction that sits at the center of the broader copyright debate now sweeping the industry.
Why the Searchable Index Changes the Conversation
Making the data searchable shifts the power dynamic between rights holders and AI developers. Before Reisner's work, artists and labels had no practical way to determine whether their recordings had been swept into a training corpus. A public, queryable database inverts that asymmetry, giving musicians, publishers, and litigators a concrete tool to trace potential infringement.
The disclosure arrives as copyright lawsuits against AI companies have multiplied across multiple content categories, including text, images, and code. Music has lagged slightly behind those fronts, but the existence of documented, multi-million-track datasets tied to named AI developers gives plaintiffs and regulators a factual foundation that earlier disputes sometimes lacked.
Reisner's methodology — identifying datasets, confirming their use through the research literature, and then surfacing the findings publicly — mirrors the approach that has driven accountability reporting in other corners of the AI training debate. The music industry will now have a clearer map of where to look.