The International Confederation of Music Publishers (ICMP) has compiled a damning dossier that accuses the world's largest AI companies of harvesting tens of millions of copyrighted songs without permission to train their generative models. The evidence, gathered over two years and shared exclusively with Billboard and other outlets, implicates Google, Microsoft, Meta, OpenAI, and Anthropic in what ICMP Director John Phelan calls "the largest IP theft in human history."
The investigation reveals that AI systems like Google's Gemini, Microsoft's Copilot, Meta's Llama 3, and OpenAI's Jukebox ingested music from licensed platforms such as YouTube and Spotify, as well as from open repositories and leaked datasets. ICMP's findings include URL lists, private dataset manifests, and model output analyses that appear to show verbatim reproductions of copyrighted lyrics and musical structures by artists including Beyoncé, Bob Dylan, The Weeknd, Mariah Carey, and Kanye West.
The Scale of the Alleged Infringement
ICMP's dossier paints a portrait of systematic, large-scale scraping. The organization claims that AI developers actively crawled public platforms and gathered audio files and lyrics en masse, converting them into training data for commercial and research models. According to ICMP, the process involved:
- Web crawlers and targeted scrapers that pulled content from YouTube, Spotify, and other services, often ignoring terms of service.
- GitHub repositories and leaked manifests containing direct links to copyrighted recordings.
- Public datasets like AudioSet and research corpora that were augmented with unlicensed material.
The trade body asserts that the ingestion went far beyond ephemeral caching, feeding the very architecture of models now powering developer APIs and consumer products. "Despite their public claims that they're not training upon copyright-protected works, we've caught many of them red-handed," Phelan said, emphasizing that the dossier includes "extensive evidence" of infringement.
Which Models and Companies Are Named?
The ICMP investigation casts a wide net, naming specific models and the companies behind them:
- OpenAI's Jukebox – A research project openly trained on 1.2 million songs, which OpenAI acknowledges included copyrighted material.
- Google Gemini – Google's flagship multimodal model, accused of scraping music for training.
- Microsoft Copilot – Microsoft's AI assistant, which allegedly used unlicensed music data in its development.
- Meta Llama 3 – Meta's open-source large language model, implicated in the evidence for audio and lyric ingestion.
- Anthropic Claude – Anthropic's conversational AI, which previously settled with publishers over lyric output guardrails but is now named for potential training data violations.
- Audio-generation startups Suno and Udio – Both implicated through leaked dataset manifests and output tests that show stylistic mimicry.
The dossier alleges that these companies and their models made copies of recorded tracks, not just lyrics or metadata, to train systems capable of reproducing musical style, melody, and vocal likeness.
Copyright, Fair Use, and the Legal Battleground
At the heart of the dispute is whether training AI on copyrighted material constitutes infringement or a transformative fair use. Copyright law protects musical compositions (melody and lyrics) and sound recordings (the specific performance). Making copies to train a model—a technical necessity—triggers the requirement to obtain a license unless an exception applies.
In the United States, the fair use doctrine examines four factors: the purpose and character of the use, the nature of the copyrighted work, the amount used, and the market effect. AI companies often argue that training is transformative because models learn patterns, not stored copies. However, courts have issued mixed rulings in analogous cases involving text and image models. The U.S. Copyright Office is studying the matter, while the EU is moving toward transparency mandates under the AI Act.
ICMP's dossier strengthens the hand of rightsholders by providing specific evidence of copying—URLs, dataset manifests, and output examples—which could compel discovery in litigation. Several parallel lawsuits are already underway: labels and publishers have sued over lyric reproduction, and Anthropic's previous settlement over output guardrails suggests a pattern of infringement.
Company Responses and Silence
Most named companies either declined to comment or issued guarded statements following the ICMP revelations. OpenAI, for example, has previously emphasized that Jukebox was a research-only release and that the company has not pursued commercial music generation as a primary product line. However, the dossier's inclusion of models like Gemini and Copilot—unmistakably commercial products—undercuts the argument that all uses were purely experimental.
Some companies have pointed to output safeguards: filters that block verbatim lyrics, voice imitation restrictions, and watermarking. Yet ICMP's complaint centers on the ingestion stage—the act of copying itself—which such mitigations do not address. As the legal landscape shifts, the industry is bracing for a period of regulatory and judicial scrutiny that may redefine what constitutes lawful AI training.
Strengths of the ICMP Case
Several factors make the ICMP dossier uniquely powerful:
- Specificity: The evidence includes URLs and manifests tied to identifiable commercial works, moving beyond generalized accusations.
- Industry backing: ICMP represents major global publishers like Universal and Sony, which have the legal and forensic resources to pursue litigation.
- Corroboration: Prior lawsuits and independent investigations (e.g., leaked Udio and Suno datasets, lyric-copying incidents) align with ICMP's findings, creating a web of overlapping evidence.
These elements mean the dossier is not merely rhetorical; it provides a factual foundation that could survive early procedural challenges in court.
Weaknesses and Legal Hurdles
Despite its strenghts, the case faces significant obstacles:
- Proving ingestion in discovery: AI companies may argue that training involved heterogeneous datasets where the presence of any single work is not dispositive. Plaintiffs must show that specific copyrighted works were used and that outputs materially replicate them—a technically demanding process.
- Transformative use defense: Courts have not yet universally resolved whether training constitutes infringement. Companies will likely argue that models extract statistical patterns, not protected expression, and that any copying is ephemeral and non-consumptive.
- Jurisdictional complexity: Training spans multiple countries with different copyright regimes, complicating enforcement and legal strategy.
Even with robust evidence, litigation could take years and cost millions, with uncertain outcomes.
Policy and Technical Responses
The dossier is already shaping the policy and commercial landscape:
- EU AI Act documentation requirements: The EU's draft rules would force model builders to disclose training datasets for large general-purpose AI, a transparency lever that rightsholders see as critical. ICMP is pushing for robust implementation.
- Copyright office studies: The U.S. Copyright Office is examining how existing law applies to AI training, and Congress may consider opt-in or opt-out frameworks.
- Licensing marketplaces: Several proposals advocate for industry-wide licensing schemes where AI firms pay for training rights, creating auditable provenance chains.
- Technical safeguards: Watermarking, provenance tools, and dataset scanners can detect unauthorized content, while output filters can prevent direct replication.
Implications for Creators, Platforms, and Windows Users
The ICMP action has broad implications:
- Music creators and publishers: The moment offers leverage to secure new revenue streams and stronger protections. Publishers should register works in machine-readable format and use industry opt-out tools to signal reservation against scraping.
- Platforms and AI developers: Companies building generative audio features must implement strict provenance audits, negotiate licenses, and obtain warranties from model vendors. The risk of statutory damages for unlicensed training can be existential.
- Windows users and developers: As AI-powered music tools proliferate on Windows (via Copilot and third-party apps), users may see more limited imitative capabilities. Developers integrating audio generation should demand clear provenance documentation and vendor indemnities.
The era of unchecked web scraping for AI is closing, replaced by an emerging regime of accountability and transactional licensing. For the Windows ecosystem—where Copilot and developer APIs are deeply integrated—the ICMP dossier is a clear signal to prioritize licensed, auditable training data.
The Path Ahead
ICMP's escalation comes at a pivotal moment. The music industry is pursuing a two-track strategy: litigation to establish legal precedents and policy advocacy to reshape the rules. The EU AI Act's documentation mandates could force unprecedented transparency, while U.S. court cases may resolve key fair use questions. In parallel, commercial negotiations may yield licensing frameworks that satisfy both AI innovators and creators.
The dossier reframes the debate from abstract concerns to concrete evidence. As discovery motions unfold and regulators sharpen their oversight, the AI companies named will face mounting pressure to prove their training data is clean—or pay the price.