Meta Platforms is under intense legal scrutiny for its alleged use of pirated materials in training its Llama AI models. The company, led by CEO Mark Zuckerberg, is accused of employing unauthorized datasets from LibGen, a well-known source of pirated books and academic articles.
Newly filed documents for a lawsuit filed in the U.S. District Court for the Northern District of California (document 1, document 2) claim that Zuckerberg directly approved the dataset’s use, despite internal warnings about its legality.
Prominent authors, including Sarah Silverman and Ta-Nehisi Coates, are among the plaintiffs, arguing that Meta’s actions violate copyright law and the Digital Millennium Copyright Act (DMCA).
They also allege violations of California’s Comprehensive Computer Data Access and Fraud Act (CDAFA), pointing to torrenting activities and metadata stripping that concealed the origins of the data.
Torrents are a peer-to-peer file-sharing technology that allows users to download files in smaller segments from multiple sources. While efficient for sharing large datasets, it is often used for distributing pirated content, as it is decentralized and difficult to monitor.
Related: Meta Admits Tapping Australian Facebook User Data for AI Training Without Consent
Approval Despite Internal Objections
Internal documents and depositions reveal a troubling pattern of decision-making at Meta. Engineers voiced concerns about the use of LibGen, with one stating, “Torrenting from a [Meta-owned] corporate laptop doesn’t feel right.”
These objections were escalated to Zuckerberg, who ultimately approved the dataset’s use. An internal memo confirmed, “After escalation to MZ [Mark Zuckerberg], Meta’s AI team was approved to use LibGen.”
This approval occurred as Meta sought to enhance the capabilities of its Llama models, a critical part of its strategy to compete in the rapidly advancing AI sector. The LibGen dataset was reportedly used for both training and fine-tuning the models, providing the large-scale data necessary to develop language-processing capabilities.
Related: Users Leave Facebook, Instagram, Threads after Zuckerberg’s Fact-Checking Reversal
Torrenting and Metadata Removal
The lawsuit accuses Meta of employing torrenting protocols to access and distribute the LibGen dataset. Torrenting inherently involves “seeding,” or sharing portions of downloaded content with other users.
According to testimony, Meta engineers configured torrenting settings to minimize visibility. As noted in the court filing, “Bashlykov configured the [torrent] settings so the smallest amount of seeding could occur,” an attempt to avoid detection while still participating in the file-sharing network.
In addition to torrenting, Meta reportedly stripped Copyright Management Information (CMI) from the training datasets. CMI includes metadata attached to copyrighted works that includes details such as the author’s name, publication date, and licensing information. Removing CMI is illegal under the DMCA if it facilitates copyright infringement.
The plaintiffs argue that this removal was a deliberate act to obscure the dataset’s origins and prevent the Llama models from outputting identifiable copyrighted content.
As the lawsuit states, “Meta stripped CMI not just for training purposes but also to hide its copyright infringement, because stripping copyrighted works’ CMI prevents Llama from outputting copyright information that might alert Llama users and the public to Meta’s infringement.”
Yann LeCun, Meta’s chief AI scientist, last year gave a hint how Meta thinks about copyrighted material when he suggested on X (formerly Twitter) that book authors should make their works freely available.
Only a small number of book authors make significant money from book sales.
— Yann LeCun (@ylecun) January 1, 2024
This seems to suggest that most books should be freely available for download.
The lost revenue for authors would be small, and the benefits to society large by comparison. https://t.co/4ObkW1tm85
Legal and Ethical Implications
The legal arguments against Meta include claims under the DMCA for removing CMI and CDAFA for accessing and using pirated data without authorization. The plaintiffs allege that Meta’s torrenting and metadata removal were integral to concealing its use of copyrighted materials.
Judge Vince Chhabria, overseeing the case, criticized Meta’s attempts to redact substantial portions of the filing, noting, “It is clear that Meta’s sealing request is not designed to protect against the disclosure of sensitive business information… Rather, it is designed to avoid negative publicity.”
The allegations against Meta are part of a broader conversation about how AI models are trained. Large language models like Llama often rely on massive datasets that may include copyrighted material.
While companies like Meta argue that such usage falls under fair use, critics contend that it infringes on the rights of creators and highlights the need for clearer legal frameworks in AI development.
Broader Industry Context
This case is not an isolated incident. The rapid development of generative AI has led to several lawsuits against major tech companies, with creators and copyright holders questioning the legality and ethics of using their works without consent.
Meta’s case reflects a broader tension between technological innovation and intellectual property laws. The lawsuit also sheds light on operational decisions within Meta, where the push to stay competitive in AI seemingly outweighed ethical and legal considerations.
Meta’s practices raise questions about how companies balance innovation with compliance and accountability. The case could set a precedent for how courts handle the use of copyrighted material in AI training, potentially influencing regulations and industry standards.