What AI Owes to Creators: Carnegie Mellon Researcher Urges Congress to Confront the Cost of Unlicensed Training Data
By Jennifer Monahan
As generative artificial intelligence accelerates into the mainstream, a fundamental question is coming into focus: What happens when machines learn by consuming copyrighted works—without permission?
On July 16, Carnegie Mellon University Professor Michael D. Smith brought that question to the U.S. Senate. In testimony before the Senate Committee on the Judiciary’s Subcommittee on Crime and Counterterrorism, Smith outlined how the unchecked use of pirated content in AI training could erode the creative economy and distort the incentives that fuel innovation.
His message to lawmakers was grounded in more than two decades of empirical research on digital media markets, copyright enforcement, and technology’s impact on the creative industries.
We’ve been here before, Smith told the committee.
When Smith started his research on this question in the early 2000s, digital piracy was a relatively new problem for the creative industries. “Many in the tech community…argued that piracy was fair use because it would not harm legal sales, was unlikely to harm creativity, and any legislative efforts to curtail piracy would not only be ineffective but would also stifle innovation,” Smith said.
But the data told a different story.
Learning from the Past
Smith’s testimony drew from an extensive body of peer-reviewed literature, including a 2020 piracy landscape study conducted with colleagues Brett Danaher and Rahul Telang for the U.S. Patent and Trademark Office. That analysis concluded that digital piracy reduced both individual income for creators and broader investment in creative work. Importantly, it also showed that well-designed legislative interventions—such as anti-piracy enforcement and platform accountability—were effective in reversing those harms.
Today’s generative AI models, Smith argued, raise strikingly similar issues. In many cases, models are trained on vast datasets scraped from the internet, including books, music, news articles, and other copyrighted materials. While some companies have pursued licensing deals, others continue to rely on unlicensed sources.
The latter approach, Smith said, carries real risks—not just for authors and artists, but for the health of the entire creative ecosystem.
“If training on pirated data is considered legal, then gen AI firms will have strong incentives to add new content to online repositories of stolen works,” Smith explained.
Economic and Legal Ramifications
Smith’s testimony also spotlighted the economic distortions that emerge when some firms license data while others do not. Recent legal cases have shown internal discussions within tech companies weighing the risks of licensing agreements against potential damage to their legal defense strategies.
In one example cited during the hearing, a company worried that signing even a single license might weaken its argument for “fair use.” In such an environment, Smith said, creators are left with little negotiating power—essentially forced to accept unfavorable terms or risk having their work used without compensation.
The downstream effects could be significant: reduced public access to freely available content, weakened markets for licensed creative work, and a growing perception that intellectual property protections no longer apply in the age of AI.
A Path Toward Sustainability
Despite these challenges, Smith’s message was not one of pessimism. He pointed to past examples—like the emergence of legal streaming platforms such as Netflix and Spotify—as proof that sustainable, market-based solutions are possible when regulation and innovation work in tandem.
“A vibrant technology economy depends on a vibrant creative economy,” Smith told the subcommittee. “We found a way to make licensed streaming and sales channels work for consumers, copyright owners, and platforms in the early 2000s, and we must do the same for generative AI.”
That path, he suggested, will require renewed clarity in copyright policy, stronger enforcement mechanisms, and a shift in industry norms toward transparency and respect for original work.
Broader Implications
The testimony arrives at a time when questions about AI’s ethical, economic, and legal boundaries are drawing increasing scrutiny from policymakers and the public. Beyond the immediate concerns of licensing and compensation, the debate also touches on long-term issues of trust, labor displacement, and the integrity of information.
By emphasizing economic evidence and historical precedent, Smith’s remarks added weight to the growing consensus that generative AI, like past waves of innovation, requires thoughtful governance—not just technological optimism.
The stakes extend well beyond individual creators. They reach into the structures that support a functioning digital society: equitable markets, informed consumers, and the freedom to create without fear of exploitation.
As Congress weighs its next steps, Smith’s testimony serves as both a warning and a roadmap—one that underscores how policy can help ensure that the benefits of AI are shared, not stolen.