Publisher Ziff Davis Sues OpenAI over AI Training Data Scraping from Its 45+ Media Websites

Digital publisher Ziff Davis has sued OpenAI, accusing the AI firm of intentionally infringing copyright and trademark rights by using content from its 45+ websites for model training.

The legal pressure on OpenAI intensifies as Ziff Davis, a large digital media conglomerate operating a portfolio of over 45 global websites including Mashable, PCMag, IGN, CNET, Lifehacker, and Eurogamer, has filed a lawsuit accusing the AI developer of extensive copyright and trademark violations.

Lodged in Delaware federal court, the suit contends OpenAI unlawfully utilized content from its properties, which attract an average of 292 million monthly visitors, to train the AI models powering services such as ChatGPT. Ziff Davis seeks damages potentially amounting to hundreds of millions of dollars and court orders to halt the alleged infringement.

Specific Allegations Detail Copyright and Technical Violations

Ziff Davis claims OpenAI “intentionally and relentlessly” copied its works, asserting this occurred even when Ziff Davis employed technical measures like robots.txt files – instructions telling web crawlers which parts of a site not to access – to prevent scraping.

The complaint details specific allegations, accusing OpenAI of reproducing Ziff Davis content both verbatim and in paraphrased forms, stripping essential Copyright Management Information (CMI) – metadata embedded in digital files identifying the work, its author, and copyright details – and instances where OpenAI’s models allegedly falsely attributed generated output to Ziff Davis brands.

Furthermore, Ziff Davis asserts it identified hundreds of complete copies of its works within OpenAI’s publicly available WebText training dataset. Beyond monetary damages, the publisher is asking the court to prevent OpenAI from further using its content and potentially compel the destruction of datasets and AI models trained on its material.

Internal Research Bolsters Legal Claims

The legal action is substantiated by earlier research conducted within Ziff Davis itself. A study published in late 2024, co-authored by Ziff Davis AI attorney George Wukoson and CTO Joey Fortuna, examined publicly known AI training datasets. The Ziff Davis paper, “The Predominant Use of High-Authority Commercial Web Publisher Content to Train Leading LLMs,” argued that the process of curating datasets for foundational models like OpenAI’s GPT-2 and GPT-3 resulted in a disproportionately high volume of content from premium commercial publishers compared to their presence on the wider web.

Their analysis indicated a marked increase in the presence of content from 15 major publisher portfolios (including Ziff Davis) as datasets moved from raw web crawls (like Common Crawl, where their share was 0.44%) to cleaned versions (like C4, 1.55%) and finally to curated sets like OpenWebText (9.91%) and its successor OpenWebText2 (12.04%), a proxy for GPT-3’s training data.

Ziff Davis also connected this reliance on curated content to higher Domain Authority scores – a metric indicating a website’s relevance and influence – suggesting AI developers intentionally prioritized authoritative web content, much of which originates from established publishers, to build valuable AI models.

The lawsuit filing reportedly claims Ziff Davis attempted to discuss licensing with OpenAI over the past year regarding the alleged infringement but was “rebuffed.”

The complaint also accuses OpenAI of trying to conceal its practices by “abandoning its founding principle of openness.” OpenAI, responding via a spokesperson to the lawsuit reports, stood by its established defense, stating its models are “grounded in fair use” – a legal concept allowing limited use of copyrighted material for certain purposes, though its application to large-scale AI training is fiercely debated.

OpenAI added that “ChatGPT helps enhance human creativity, advance scientific discovery and medical research, and enable hundreds of millions of people to improve their daily lives.” Ziff Davis declined further comment beyond the court filing, with sources suggesting part of the motivation was the hope other publishers might follow their lead.

Publishers Confront AI: Litigation vs. Licensing Deals

Ziff Davis enters a legal arena already populated by major media players. The New York Times’s lawsuit against OpenAI and Microsoft, filed in December 2023, is actively progressing after a judge allowed core claims to move forward in late March.

Newspapers owned by Alden Global Capital and writers represented by the Authors Guild are also pursuing claims. Earlier this month, several of these copyright cases against OpenAI were consolidated by a U.S. judicial panel, indicating the courts are preparing to grapple with these complex issues collectively. These cases challenge the core “fair use” defense employed by AI companies, with potential damages under copyright law reaching up to $150,000 per willful infringement.

While battling lawsuits, OpenAI has simultaneously pursued content licensing agreements. Deals with companies like News Corp, Axel Springer, The Associated Press, and The Washington Post – the latter announced just two days before the Ziff Davis filing – provide OpenAI with access to content, often focused on integrating real-time news into chatbot responses rather than bulk data for training.

This divergence highlights a fundamental split in the publishing industry’s approach to AI. Further complicating creator relations is OpenAI’s failure to deliver its promised “Media Manager” tool by the start of 2025, a system intended to give content owners more direct control over the inclusion of their work in AI training processes.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x