How Google, Meta, and Microsoft Built AI on Scraped Copyrighted Content While Banning Everyone Else

A two-year investigation by the International Confederation of Music Publishers (ICMP) has compiled what it calls "the largest IP theft in human history"—a dossier alleging that Google, Microsoft, Meta, OpenAI, and other tech giants systematically trained their AI models on copyrighted music, lyrics, and video content while their own terms of service prohibit the same scraping on their platforms. The accusation, detailed in a Billboard exclusive and corroborated by a separate analysis in The Atlantic, exposes a glaring double standard at the heart of the generative AI boom: companies demand permission for data use from others, yet operate vast, opaque data harvesting pipelines themselves.

ICMP director general John Phelan told Billboard that "tens of millions of works" are being infringed daily, pointing to evidence that Meta’s Llama 3 ingested lyrics by artists such as The Weeknd and Ed Sheeran, that AI music apps Udio and Suno scraped YouTube, and that Anthropic’s Claude replicated hundreds of song lyrics including "American Pie" and "Halo." Microsoft’s Copilot and Google’s Gemini similarly reproduced copyrighted lyrics, the dossier claims. The evidence, which ICMP says includes private datasets, leak-sourced manifests, and court filings, paints a picture of catalog-level ingestion rivaling the scale of any collection effort in digital history.

YouTube: The Mother Lode of AI Training Data

The music industry’s outcry is matched by findings from a parallel investigation into video training data. The Atlantic reported that at least 15.8 million YouTube videos from over 2 million channels have been downloaded without permission and bundled into at least 13 datasets used by companies including Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. A dedicated tool now allows creators to search whether their videos appear in these sets, many of which stripped titles and channel names but preserved unique YouTube IDs—making the mass archiving traceable but not consent-based.

These datasets, analyzed by Proof News and Wired, include the widely used “Pile” corpus containing 173,536 YouTube subtitle entries from 48,000 channels, as well as industrial collections such as HowTo100M, ACAV100M, and HD-VILA-100M. Often curated to favor high view counts, “aesthetic quality,” and camera movement, they fed text-to-video generators now embedded in consumer products: Meta’s Movie Gen, Snap’s AI Video Lenses, and Google’s Veo 3. A leaked spreadsheet from Runway, reported by 404 Media, showed channels prioritized for “super high quality sci-fi short films” and car cinematics, labeled “THE HOLY GRAIL OF CAR CINEMATICS SO FAR.”

The scale is not limited to external scrapers. Internal platform data is also harnessed—Google trained on at least 70 million YouTube clips, Meta on more than 65 million Instagram clips. Creators who built their livelihoods on these platforms now compete with synthetic content derived from their own work, often without compensation or consent.

A Double Standard in Plain Sight

While platform terms explicitly bar automated scraping, those same companies have built their AI on scraped data. YouTube’s developer policies require prior written permission for automated collection beyond what robots.txt allows; Meta’s terms forbid unapproved automated data collection; and X, Google, OpenAI, Microsoft, and Adobe have similar clauses. ICMP’s Phelan underscored the contradiction: “All we hear from AI and tech companies is, ‘We need exceptions to build an open internet and access data, wholescale, without licenses, for our training.’ What our work on AI shows is that at the very same time, they’re demanding everybody else get prior written permission before using their content.”

The contractual inconsistency is not merely rhetorical—it has legal and competitive implications. Creators rely on platform terms to control reuse and monetize their work. When these same platforms turn a blind eye to scraping for AI training, they privilege their own commercial ambitions over the protections they offer users. Furthermore, selective enforcement of terms of service may be seen as a breach of good faith, opening the door to litigation and regulatory action.

Legal Battles and the Fair Use Question

Lawsuits are piling up. Music publishers are suing Anthropic, which has agreed to implement guardrails to prevent verbatim lyric reproduction as part of a litigation compromise. Suno faces its own lawsuit, as do Midjourney and other image-generation firms. The New York Times is locked in a high-stakes copyright suit against OpenAI. Each case chips away at the industry’s common defense: that training on publicly available data is fair use. While U.S. courts have yet to settle that question definitively, early rulings and settlements suggest that storing pirated copies or near-verbatim outputs can tip the scales toward infringement.

Court filings have already produced evidence of model outputs replicating copyrighted text. In the Anthropic case, plaintiffs cited Claude’s reproduction of lyrics from “American Pie” and “Halo.” Such reproductions undermine claims that models merely learn patterns and do not copy. Discovery orders in ongoing cases may force companies to reveal internal training manifests, shedding light on practices that have remained hidden behind corporate secrecy.

Transparency: Metadata Abounds, Yet Disclosure Lags

One of the most revealing aspects of the dossier is the granularity of the metadata attached to scraped content. Private manifests and leaked spreadsheets show tracks tagged with artist, genre, tempo, and explicit lyrics, and video datasets annotated with scene types and caption alignments. This level of curation demolishes the argument that disclosing training data sources is technically infeasible. On the contrary, it indicates that companies already possess detailed provenance records—the kind the EU’s AI Act may soon require for high-risk models.

The evidence suggests that traceability is not a technical hurdle but a choice, one that companies have so far avoided making public. The EU AI Act’s provisions on data governance and provenance logs signal a shift toward accountability, but enforcement timelines and technical standards remain works in progress. Without strong regulations, the onus falls on companies to voluntarily adopt measures that would resolve the double standard: mandatory provenance logging, opt-in licensing at scale, and consistent enforcement of terms of service.

The Creator and Consumer Fallout

The economic threat is real. Generative AI can churn out background music, stock footage, and even news summaries, commoditizing creative labor and diverting revenue from licensing, performance royalties, and direct sales. Platforms that depend on creator content for engagement risk losing the trust of those same creators, as seen in recent moves by YouTube to let creators opt into third-party AI training—a concession born of mounting pressure. High-profile creators and large publishing houses can marshal public opinion and legal resources, and platform trust erodes further when creators feel exploited.

For consumers, the immediate impact is less tangible but no less significant: the content they value is increasingly generated by algorithms trained on uncompensated work, potentially lowering quality and diversity over time. The data sets themselves also carry risks of contamination—poisoned or misattributed data can degrade model performance and complicate compliance audits, creating technical debt that will be expensive to fix.

Paths to a Sustainable Model

Resolving the double standard requires action from multiple fronts. Platforms and AI firms should implement mandatory provenance logging, recording per-work sources for datasets used in training and maintaining those records for a set period (e.g., 7–10 years). This is feasible, as leaked manifests demonstrate. They should also scale creator opt-in and licensing mechanisms—YouTube’s recent third-party training opt-in is a start but must expand industry-wide. Enforcement of platform terms of service must be consistent: if scraping is prohibited, it must be blocked regardless of who is doing the scraping.

Creators and publishers can use platform controls and metadata tools to signal reuse preferences, pursue collective licensing schemes that scale to AI training scenarios, and invest in watermarking technologies. Publishers holding large catalogs must negotiate contractual guarantees with AI firms that specify training data provenance and prohibit ingestion of unlicensed content.

Regulators should require data provenance obligations for high-impact AI models, clarify the scope of fair use in training contexts, and support technical standards for watermarking, dataset manifests, and audit trails. The EU AI Act is a step in this direction, but global coordination will be necessary to prevent fragmentation.

Conclusion

The ICMP dossier and The Atlantic investigation have forced a long-overdue reckoning. The data double standard cannot hold: companies cannot simultaneously argue for broad data access to fuel their AI while locking down their own platforms against the very same practices. The technical breadcrumbs—manifest files, leaked spreadsheets, and metadata logs—prove that transparency is possible. The question is whether the industry will embrace it voluntarily or be compelled by regulation and litigation.

As lawsuits proceed and regulators sharpen their focus, the next chapter will be written in courtrooms and policy hearings. The evidence gathered so far suggests that the AI industry’s hunger for data has outpaced its respect for copyright, and the bill is coming due. For now, users and creators can check whether their work has been swept into the maw of datasets using tools like the YouTube ID search. It’s a small measure of visibility in an otherwise opaque system, but it hints at the transparency that must become the norm if generative AI is to build on a foundation of consent rather than contention.