Microsoft and OpenAI, facing allegations of copyright infringement from The New York Times and other publishers, have argued in federal court on Tuesday that their use of publicly available news articles to train large language models (LLMs) is lawful under the fair use doctrine.
The lawsuit, filed in December 2023 and now consolidated with similar claims from The New York Daily News and the Center for Investigative Reporting, contends that AI systems like ChatGPT and Microsoft Copilot have leveraged copyrighted material without authorization, undermining publishers’ revenues and intellectual property rights.
The plaintiffs assert that these AI models, trained on datasets containing millions of articles, can reproduce or summarize their content in ways that substitute for the original works. “This is about replacing the content, not transforming it,” said Ian Crosby, representing The New York Times.
Crosby warned that such practices could divert between 30% and 50% of online news traffic away from publishers’ websites.
Fair Use Doctrine at the Heart of the Case
OpenAI’s defense hinges on the argument that their use of news data is transformative and therefore protected by the fair use doctrine. Joseph Gratz, an attorney for OpenAI, explained to Judge Sidney Stein that ChatGPT processes data by breaking it into smaller units called tokens, allowing the model to recognize patterns and generate new content rather than directly replicating text.
Joseph Gratz, an OpenAI lawyer, said regurgitating entire articles “is not what it is designed to do and not what it does” when it comes to how ChatGPT operates, arguing that outputs resembling copyrighted material often occur only after specific user prompts deliberately attempt to elicit such responses.
Microsoft’s legal team supported these claims, drawing parallels between AI training and earlier technological innovations such as VCRs and copy machines, which were initially contested but ultimately deemed lawful.
They argued that fair use allows for the development of technologies that benefit society without compromising the rights of content creators. “Copyright law is no more an obstacle to the LLM than it was to the VCR (or the player piano, copy machine, personal computer, internet, or search engine),” the company stated in its court filings.
Publishers Claim Financial and Ethical Harm
The publishers argue that the unlicensed use of their content not only violates copyright law but also threatens their financial sustainability. The lawsuit highlights specific examples where AI tools summarize articles or provide product recommendations that bypass publishers’ paywalls.
According to the Times, Microsoft’s Bing Chat—now rebranded as Copilot—has redirected potential readers away from its affiliate platform Wirecutter, reducing traffic and revenue.
Steven Lieberman, representing The New York Daily News, criticized the tech companies’ reliance on sources like Common Crawl, a nonprofit organization that aggregates web data for public use. He described the practice as “free riding” on the work of journalists and publishers, enabling AI companies to monetize content they did not create or license.
While OpenAI argues that this approach democratizes access to data, critics point out that it includes copyrighted materials without proper vetting.
Compounding the issue is OpenAI’s use of retrieval-augmented generation (RAG), a method that integrates real-time information from the web into AI-generated responses. Although this technique enhances the relevance and accuracy of outputs, it raises questions about how publishers’ content is accessed and reproduced.
High Stakes: Potential Dataset Destruction and Financial Penalties
The lawsuit seeks billions of dollars in damages and calls for the destruction of datasets containing unauthorized materials. Such a ruling could have profound implications for OpenAI and Microsoft, forcing them to rebuild their AI systems using only licensed or public domain content.
Federal copyright law allows fines of up to $150,000 for each instance of willful infringement, a figure that could escalate dramatically given the volume of data involved.
Delayed Media Manager Tool and Industry Responses
The lawsuit also underscores frustrations over OpenAI’s delayed rollout of its Media Manager tool, initially promised in May 2024 to give creators greater control over how their content is used in AI training datasets.
Critics argue that this failure leaves smaller publishers and independent creators with limited options for protecting their intellectual property.
While major publishers like TIME, The New Yorker, Vogue, Vanity Fair, Bon Appetit, and Wired and more than 2oo other publications have secured licensing agreements with OpenAI, many smaller players lack the resources to negotiate similar deals.
The broader industry remains divided, with some companies embracing partnerships to license content for AI development, while others pursue litigation. In Canada, a coalition of publishers has filed lawsuits accusing OpenAI of “widespread scraping,” and prominent authors like Michael Chabon have voiced similar concerns.
Judge to Rule on Dismissal Motion
Judge Sidney Stein, who demonstrated a strong understanding of the technical issues during the hearing, has yet to rule on the defendants’ motion to dismiss.
Stein acknowledged the complexity of the case, stating that fair use would likely play a pivotal role in his decision. The outcome could set a critical precedent for how generative AI systems interact with copyrighted materials and the obligations of developers toward content creators.
As the legal proceedings continue, the implications extend far beyond OpenAI and Microsoft. This case has the potential to shape the future of generative AI, balancing innovation with the rights of publishers and creators.