HomeWinBuzzer NewsCourt Filing Reveals How Zuckerberg Approved Using Pirated Content for AI Training...

Court Filing Reveals How Zuckerberg Approved Using Pirated Content for AI Training of LLama Models

Court documents show how Meta removed metadata from AI training datasets to obscure the use of copyrighted materials.

-

Meta Platforms is under intense legal scrutiny for its alleged use of pirated materials in training its Llama AI models. The company, led by CEO Mark Zuckerberg, is accused of employing unauthorized datasets from LibGen, a well-known source of pirated books and academic articles.

Newly filed documents for a lawsuit filed in the U.S. District Court for the Northern District of California (document 1, document 2) claim that Zuckerberg directly approved the dataset’s use, despite internal warnings about its legality.

Prominent authors, including Sarah Silverman and Ta-Nehisi Coates, are among the plaintiffs, arguing that Meta’s actions violate copyright law and the Digital Millennium Copyright Act (DMCA).

They also allege violations of California’s Comprehensive Computer Data Access and Fraud Act (CDAFA), pointing to torrenting activities and metadata stripping that concealed the origins of the data.

Torrents are a peer-to-peer file-sharing technology that allows users to download files in smaller segments from multiple sources. While efficient for sharing large datasets, it is often used for distributing pirated content, as it is decentralized and difficult to monitor.

Related: Meta Admits Tapping Australian Facebook User Data for AI Training Without Consent

Approval Despite Internal Objections

Internal documents and depositions reveal a troubling pattern of decision-making at Meta. Engineers voiced concerns about the use of LibGen, with one stating, “Torrenting from a [Meta-owned] corporate laptop doesn’t feel right.”

These objections were escalated to Zuckerberg, who ultimately approved the dataset’s use. An internal memo confirmed, “After escalation to MZ [Mark Zuckerberg], Meta’s AI team was approved to use LibGen.”

This approval occurred as Meta sought to enhance the capabilities of its Llama models, a critical part of its strategy to compete in the rapidly advancing AI sector. The LibGen dataset was reportedly used for both training and fine-tuning the models, providing the large-scale data necessary to develop language-processing capabilities.

Related: Users Leave Facebook, Instagram, Threads after Zuckerberg’s Fact-Checking Reversal

Torrenting and Metadata Removal

The lawsuit accuses Meta of employing torrenting protocols to access and distribute the LibGen dataset. Torrenting inherently involves “seeding,” or sharing portions of downloaded content with other users.

According to testimony, Meta engineers configured torrenting settings to minimize visibility. As noted in the court filing, “Bashlykov configured the [torrent] settings so the smallest amount of seeding could occur,” an attempt to avoid detection while still participating in the file-sharing network.

In addition to torrenting, Meta reportedly stripped Copyright Management Information (CMI) from the training datasets. CMI includes metadata attached to copyrighted works that includes details such as the author’s name, publication date, and licensing information. Removing CMI is illegal under the DMCA if it facilitates copyright infringement.

The plaintiffs argue that this removal was a deliberate act to obscure the dataset’s origins and prevent the Llama models from outputting identifiable copyrighted content.

As the lawsuit states, “Meta stripped CMI not just for training purposes but also to hide its copyright infringement, because stripping copyrighted works’ CMI prevents Llama from outputting copyright information that might alert Llama users and the public to Meta’s infringement.”

Yann LeCun, Meta’s chief AI scientist, last year gave a hint how Meta thinks about copyrighted material when he suggested on X (formerly Twitter) that book authors should make their works freely available.

Legal and Ethical Implications

The legal arguments against Meta include claims under the DMCA for removing CMI and CDAFA for accessing and using pirated data without authorization. The plaintiffs allege that Meta’s torrenting and metadata removal were integral to concealing its use of copyrighted materials.

Judge Vince Chhabria, overseeing the case, criticized Meta’s attempts to redact substantial portions of the filing, noting, “It is clear that Meta’s sealing request is not designed to protect against the disclosure of sensitive business information… Rather, it is designed to avoid negative publicity.”

The allegations against Meta are part of a broader conversation about how AI models are trained. Large language models like Llama often rely on massive datasets that may include copyrighted material.

While companies like Meta argue that such usage falls under fair use, critics contend that it infringes on the rights of creators and highlights the need for clearer legal frameworks in AI development.

Broader Industry Context

This case is not an isolated incident. The rapid development of generative AI has led to several lawsuits against major tech companies, with creators and copyright holders questioning the legality and ethics of using their works without consent.

Meta’s case reflects a broader tension between technological innovation and intellectual property laws. The lawsuit also sheds light on operational decisions within Meta, where the push to stay competitive in AI seemingly outweighed ethical and legal considerations.

Meta’s practices raise questions about how companies balance innovation with compliance and accountability. The case could set a precedent for how courts handle the use of copyrighted material in AI training, potentially influencing regulations and industry standards.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x