OpenAI, embroiled in a major copyright lawsuit initiated by the Authors Guild, is resisting new demands to review extensive company files, arguing that the proposed scope would overwhelm its resources.
This latest development marks the latest phase in the ongoing legal battle centered around allegations that OpenAI’s AI models were trained using copyrighted books without author consent.
Attorney Carolyn M. Homer, representing OpenAI, detailed in a federal court filing that fulfilling the Authors Guild’s request would involve processing more than 886,000 documents, including files from cofounder Ilya Sutskever and other key personnel.
Authors Guild Lawyers Demand Document Review
The Authors Guild is pushing for access to documents from current and former OpenAI employees, such as former researcher Jan Leike and other technical staff. According to Homer, adding these additional data sets would mean reviewing over 375 gigabytes, on top of the existing 24 custodians’ 460,000 documents (approximately 359 gigabytes).
OpenAI has highlighted that such a task would be complicated by an estimated 71% overlap with data already being reviewed.
This demand is part of a broader wave of actions by writers seeking to curb what they argue is the unauthorized use of their work by generative AI platforms. High-profile authors including George R.R. Martin and John Grisham have joined the legal push, accusing OpenAI of leveraging their work without proper licensing.
The Guild pointed to instances where ChatGPT generated detailed outlines that closely resembled copyrighted material, such as a hypothetical prequel to A Game of Thrones.
Copyright Challenges in Generative AI
This case is not an isolated incident. In the summer of 2023, authors Mona Awad and Paul Tremblay filed similar lawsuits, alleging OpenAI’s unauthorized use of their published works.
Comedian Sarah Silverman and other writers took further legal action, accusing both OpenAI and Meta of sourcing material from piracy sites like Z-Library to train their AI models. Such lawsuits have pushed for clearer regulations and a reassessment of how intellectual property laws intersect with the growing capabilities of generative AI.
Adding to OpenAI’s legal challenges, Raw Story Media and Alternet Media previously claimed that OpenAI had violated the DMCA by scraping their articles without retaining proper copyright management information (CMI).
However, a New York court dismissed the case on grounds of insufficient proof of actual harm. Judge Colleen McMahon ruled that plaintiffs failed to show that OpenAI’s generative models, which synthesize rather than replicate data, posed a substantial risk of reproducing their content verbatim [source].
Internal Tensions and Strategic Shifts at OpenAI
Alongside its legal issues, OpenAI has faced internal restructuring. In May 2024, co-founder and chief scientist Ilya Sutskever stepped down following disagreements over company strategy, notably after a failed attempt by the board to remove CEO Sam Altman.
This move, meant to address transparency concerns, met resistance from major investor Microsoft and led to Altman’s reinstatement. Jakub Pachocki, known for pioneering deep learning research, was appointed to succeed Sutskever.
Jan Leike, who co-led OpenAI’s Superalignment team, also departed to join Anthropic, a rival AI firm prioritizing safety. Leike’s move reflects ongoing debates within the industry about the balance between innovation and caution, especially concerning scalable oversight and alignment in advanced AI models. These shifts indicate OpenAI’s evolving focus on both advancing AI technology and managing internal and external pressures.
Broader Implications for Copyright and Licensing
The Authors Guild’s push for access to more of OpenAI’s internal data signals a wider call for fairer licensing agreements and compensation for content creators. While OpenAI has established contracts with major publishers like Condé Nast, similar deals for smaller entities and individual authors remain sparse, fueling demands for more equitable practices.
The outcome of this lawsuit and related cases may set a precedent for how future generative AI models interact with copyrighted content and how intellectual property laws evolve to address these interactions.