Microsoft Is Getting Sued over Using Nearly 200,000 Pirated Books for AI Training

The AI copyright war pivots from 'fair use' to data piracy. A new lawsuit targeting Microsoft and a key ruling against Anthropic put the industry's data sourcing methods under intense legal scrutiny.

A chaotic flurry of landmark court rulings and new lawsuits has dramatically reshaped the legal war over artificial intelligence, creating a contradictory and perilous battlefield for the tech industry. While Meta scored a major victory this week in a case over its AI training data, the win stands in stark contrast to a nuanced ruling against AI firm Anthropic that has redefined the front lines of the copyright fight. The legal ground is shifting from what AI models produce to how they were built in the first place.

The core of the conflict was laid bare in two separate federal court decisions just days apart. On Wednesday, a judge ruled that Meta did not violate the law when it trained its AI on authors’ books. This decision came just after a landmark decision in a case against Anthropic established a critical new precedent: while the act of training an AI model on copyrighted books may be a “transformative” fair use, the separate act of acquiring that data from pirated online libraries is not.

This new legal vulnerability is already being exploited. Underscoring the escalating stakes, a group of authors filed a new lawsuit against Microsoft on Tuesday, alleging the company used a massive collection of pirated books to train its Megatron AI models. The outcome of these battles will determine the legal foundation of the generative AI industry, placing the data supply chains of tech giants under intense scrutiny.

A Fractured Ruling on ‘Fair Use’

The moment that redefined the AI copyright war came from a summary judgment order issued on June 23 by Senior District Judge William Alsup. In a case against AI firm Anthropic, he found the company’s use of copyrighted books to train its Claude AI was “quintessentially transformative,” stating, “The technology at issue was among the most transformative many of us will see in our lifetimes.”. However, Judge Alsup drew a sharp line, ruling that this protection does not excuse the underlying methods used to build the dataset. He concluded that a trial must proceed to determine the damages resulting from Anthropic’s use of pirated copies to build its library.

This ruling helps explain the seemingly contradictory outcome in Meta’s case just two days later. The two cases were at fundamentally different procedural stages. The Meta ruling dismissed the authors’ claim at an early stage, whereas the Anthropic decision was a more substantive judgment based on evidence, setting a powerful precedent that separates the legality of the training process from the legality of data acquisition.

The intensifying legal fight has exposed a deep fracture in how content owners are responding, with stakeholders pursuing divergent strategies of litigation, licensing, and outright prohibition. The Authors Guild has long been at the forefront of litigation, filing a suit against OpenAI back in 2023. The new lawsuit against Microsoft shows their strategy is adapting to the latest legal precedents.

In sharp contrast, some publishers are opting for deals. HarperCollins partnered with Microsoft to license its nonfiction works for AI training, a move that sparked backlash from authors over its proposed flat-fee compensation. Taking the opposite approach, the world’s largest trade publisher, Penguin Random House, has attempted to opt out of the AI ecosystem entirely. The company updated its copyright rules globally, adding a new clause to its books that explicitly forbids any reproduction of the work for the purpose of training AI systems.

From ‘Shadow Libraries’ to Verbatim Memorization

Fueling these lawsuits is a growing body of evidence challenging the tech industry’s claim that AI models only learn statistical patterns. The complaint against Microsoft alleges the company used a pirated dataset to create what the filing describes as a model “built to generate a wide range of expression that mimics the syntax, voice, and themes of the copyrighted works on which it was trained.” This accusation of building models on data from “shadow libraries” like LibGen and Books3 is central to the cases of Meta and Anthropic as well.

This legal argument is bolstered by technical research. A new academic study revealed that Meta’s Llama 3.1 model could reproduce a staggering 42% of Harry Potter and the Sorcerer’s Stone verbatim. The research paper lends concrete support to the argument that the models themselves contain what one legal expert called “a copy of part of the book in the model itself.” According to a legal analysis by the Electronic Frontier Foundation (EFF), the court’s focus on the initial act of piracy creates a vital firewall. EFF Legal Director Corynne McSherry noted, “The court’s firewall is critical. It prevents the transformative use doctrine from becoming a get-out-of-jail-free card for mass infringement.”.

A Legal War Goes Global

This complex legal and ethical war is not confined to the United States. The BBC is threatening legal action against AI search engine Perplexity, accusing it of copyright infringement and joining a chorus of news organizations fighting back against data scraping. The battle has also reached India, where the Federation of Indian Publishers (FIP) has sued OpenAI over ChatGPT’s ability to generate detailed book summaries, which they argue directly undermines their market.

Governments are now being forced to react. Japan’s Agency for Cultural Affairs has convened an expert panel to urgently review the country’s copyright laws, which are currently considered permissive toward AI training. The panel aims to propose clearer guidelines by the end of the year, signaling that the era of ambiguous rules is rapidly coming to an end.

The legal landscape for generative AI is becoming more complex, not less. The focus has decisively shifted from the magic of the final product to the gritty details of data provenance. For tech companies, the message from the courts is becoming clear: the transformative nature of the technology will not excuse the original sin of theft. This new reality forces a reckoning with their data supply chains, likely ushering in an era of greater transparency and costly licensing deals that will fundamentally change the economics of the AI arms race.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x