Meta Trained Its AI on Copyrighted Books, Court Filing Reveals

A new court filing reveals that Meta used copyrighted books for AI training, with internal emails showing executives were aware of legal risks.

A newly released court filing has provided the clearest evidence yet that Meta used copyrighted books, obtained through sources like LibGen, to train its Llama AI models.

Internal emails included in the lawsuit show that employees were aware of the legal risks but continued the training. The documents also confirm that Meta halted licensing negotiations while proceeding with unapproved data.

The filing, part of a copyright lawsuit in the U.S. District Court for the Northern District of California, highlights how Meta’s AI division debated the use of pirated content. The lawsuit follows mounting legal pressure on AI companies, including OpenAI and Google, over their data-collection practices.

The new evidence suggests that Meta executives, including CEO Mark Zuckerberg, were informed about potential copyright violations but did not intervene.

Internal Emails Reveal Copyright Concerns

Emails included in the court filing show that Meta’s engineers and legal teams had conflicting views on whether using copyrighted books for AI training was legally defensible. In one discussion, an employee suggested that it might be better to seek approval rather than act first and ask for forgiveness later:

“[M]y opinion would be (in the line of ‘ask forgiveness, not for permission’): we try to acquire the books and escalate it to execs so they make the call.”

Despite the risks, Meta proceeded with its AI training. The lawsuit alleges that this approach was intentional and designed to avoid the costs and restrictions of licensing agreements. The company’s decision aligns with earlier reports that Meta had suspended licensing negotiations while continuing to source unlicensed data.

Meta’s Use of LibGen and Torrenting for AI Training

The lawsuit explicitly claims that Meta acquired training data from Library Genesis (LibGen), a well-known online repository for pirated academic books. Internal discussions referenced the use of torrenting software to obtain books, with employees adjusting settings to reduce seeding activity—a common method to avoid detection when sharing files via peer-to-peer networks.

Further, the lawsuit claims that Meta deliberately removed Copyright Management Information (CMI) from its training data. The court filing states:

“Meta stripped CMI not just for training purposes but also to hide its copyright infringement, because stripping copyrighted works’ CMI prevents Llama from outputting copyright information that might alert Llama users and the public to Meta’s infringement.”

This accusation is particularly serious, as stripping CMI can be a violation of the Digital Millennium Copyright Act (DMCA) if done to conceal unauthorized use of copyrighted works.

Judge Rejects Meta’s Attempt to Hide AI Training Details

Meta attempted to redact parts of the court filings related to its AI training data, but Judge Vince Chhabria ruled against the company, stating:

“It is clear that Meta’s sealing request is not designed to protect against the disclosure of sensitive business information… Rather, it is designed to avoid negative publicity.”

The ruling reinforces the growing judicial scrutiny over AI companies’ training practices and their reluctance to disclose data sources. The court’s decision also aligns with broader concerns that AI firms are exploiting copyrighted materials without compensation to content creators.

Legal Pressure Mounts on AI Companies

Meta is not alone in facing legal challenges over AI training data. Other companies, including OpenAI and Google, are also dealing with lawsuits from authors, publishers, and media organizations. These cases argue that AI-generated outputs often mimic copyrighted texts, making unlicensed training a violation of intellectual property laws.

The latest one of these high-profile cases is a new copyright lawsuit filed against OpenAI by the Federation of Indian Publishers (FIP), representing over 80% of India’s publishing industry. The New York Times is engaged in a separate dispute against Microsoft and OpenAI over news content usage.

While AI companies claim that training models do not store direct copies of copyrighted content, plaintiffs argue that the outputs frequently resemble original works. Courts will now determine whether current copyright laws apply to AI training or if legislative changes are needed.

Meta’s AI Strategy Faces Growing Scrutiny

As part of its AI development strategy, Meta has promoted Llama as an open-source alternative to proprietary models like OpenAI’s GPT. However, the latest lawsuit raises questions about whether the company is adhering to ethical AI practices.

Yann LeCun, Meta’s chief AI scientist, has previously defended the unrestricted access to books and digital knowledge, arguing:

“Only a small number of book authors make significant money from book sales. This seems to suggest that most books should be freely available for download. The lost revenue for authors would be small, and the benefits to society large by comparison.”

LeCun’s comments reflect an ongoing ideological divide between AI developers and copyright holders. While some advocate for open-access AI training, authors and publishers argue that this approach disregards intellectual property rights.

What Comes Next?

With legal cases against multiple AI companies intensifying, Meta could face pressure to disclose more about its training data sources. If courts rule against the company, the AI industry may be forced to obtain explicit licensing agreements before using copyrighted material, reshaping how models like Meta AI are trained.

Regulators in the U.S. and Europe are already considering policies that could impose stricter controls on AI data sourcing. The legal decisions made in cases like this one may ultimately define the boundaries of AI training and copyright compliance for years to come.

Table: AI Model Benchmarks – LLM Leaderboard 

[table “18” not found /]

Last Updated on March 3, 2025 11:28 am CET

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x