Cohere Joins The Club Of AI Companies Sued For Breaching Copyright

Major publishers have filed a lawsuit against Cohere, alleging the AI firm has used copyrighted content to train its generative models without authorization.

Condé Nast, McClatchy, and other leading publishers have filed a lawsuit against Cohere, alleging the AI firm unlawfully used their copyrighted content to train its “Command Family” of generative models.

The lawsuit claims that Cohere’s AI systems, which employ retrieval-augmented generation (RAG), produce outputs that closely mirror original works, violating intellectual property protections. The case marks a critical moment in the growing conflict between AI developers and media organizations over content ownership and fair use.

The News/Media Alliance, a trade organization representing dozens of publishers in the lawsuit, emphasized the broader implications of unlicensed AI use in a statement.

“We are going to court to protect our rights. As generative AI becomes more prevalent, it is imperative that legal protections be enforced so that innovation can flourish responsibly. This not only protects investments in the creative process and developing intellectual property, but supports the quality of what users consume and the sustainability of the AI products themselves,” Danielle Coffey, President and CEO of the News/Media Alliance commented.

Tony Hunter, Chair of the Board of the News/Media Alliance, stated, “Today marks a historic moment as our members unite to take a stand against the unlawful use of our intellectual property. This is a crucial step in protecting the value of our journalism.”

Pam Wasserstein, President and Vice Chair of Vox Media, remarked, “This is a case about the blatant theft of our original work to create a competing commercial product. While we welcome responsible technological innovation, with this litigation we’re putting AI companies on notice that they are not above the law and we will enforce our intellectual property rights.”

Generative AI and Data Sourcing Concerns

Generative AI models like Cohere’s Command Family rely on extensive datasets to generate human-like text. These systems often utilize resources such as Common Crawl, an open repository containing billions of web pages.

While these datasets are integral to training language models, they include significant amounts of copyrighted material, raising questions about their legality. Critics argue that the widespread, unregulated use of proprietary content allows AI companies to sidestep licensing agreements while benefiting from high-quality materials.

The plaintiffs in the lawsuit against Cohere allege that the company’s AI models reproduce substantial portions of their proprietary content. According to the court filing, Cohere has copied, trained on, and incorporated Plaintiffs’ proprietary content into its generative AI models without authorization, producing outputs that mimic Plaintiffs’ original works and diminish their economic value.” This accusation forms the foundation of the legal argument, which hinges on the balance between fair use and outright duplication.

The lawsuit further highlights the role of retrieval-augmented generation (RAG) in complicating these issues. Cohere’s AI systems use RAG technology to combine pre-trained datasets with real-time data retrieval during text generation, a process designed to enhance relevance and accuracy. While effective, the plaintiffs claim that this approach exacerbates the misuse of copyrighted material by directly integrating proprietary content into AI-generated responses.

The court filing elaborates on this argument, stating, “Through its retrieval-augmented generation process, Cohere directly accesses and incorporates proprietary publisher content, replicating their language, structure, and phrasing without meaningful transformation or authorization.”

Critics of Cohere’s practices argue that outputs produced by these models harm publishers by undermining the economic value of original journalism. This concern is amplified by the inclusion of proprietary material in training datasets, which publishers claim erodes their ability to control how their work is used. The case underscores the broader challenge of defining clear legal boundaries in the rapidly evolving generative AI landscape.

Federal copyright law allows fines of up to $150,000 for each instance of willful infringement, a figure that could escalate dramatically given the volume of data involved.

An Avalanche of Lawsuits

The lawsuit against Cohere follows a similar case filed against OpenAI from just weeks ago, where the Federation of Indian Publishers (FIP), representing over 80% of India’s publishing industry, filed a copyright infringement lawsuit against the company.

In December 2023, The New York Times filed a lawsuit against OpenAI and Microsoft, alleging their AI tools, such as ChatGPT and Bing Chat, were trained on its articles without authorization.

The complaint argued that these models generated text resembling original reporting, bypassing paywalls and diverting readers from the publisher’s platform.

Microsoft and OpenAI, recently argued in federal court that their use of publicly available news articles to train large language models (LLMs) is lawful under the fair use doctrine.

Licensing Agreements and Industry Collaboration

While lawsuits highlight the contentious relationship between publishers and AI developers, some companies have pursued licensing agreements to address these concerns. OpenAI, for example, has partnered with Vox Media and The Atlantic, allowing the AI company to use their archives in exchange for compensation and proper attribution.

Similar agreements were made with TIME magazine, UK publishing company Future PLC for over 200 brands like Tom’s Guide, PC Gamer, TechRadar and Marie Claire, and Condé Nast, for Content From The New Yorker, Vogue, Vanity Fair, Bon Appetit, and Wired.

These partnerships provide publishers with new revenue streams and enable AI developers to access high-quality content legally.

However, not all publishers view these agreements as a long-term solution. Many argue that such arrangements fail to address the broader issue of unlicensed AI training on datasets containing proprietary material.

The News/Media Alliance emphasizes that the lack of oversight in AI data sourcing creates an uneven playing field, stating, “As news, magazine, and media publishers, we serve an important role in keeping society informed and supporting the free flow of information and ideas, but we cannot continue to do so if AI companies like Cohere are able to undercut our businesses while using our own content to compete with us.”

For Cohere, which has not disclosed any licensing deals with publishers, the lawsuit demands both financial damages and the removal of proprietary content from its datasets.

The Economic Stakes for Publishers

At the heart of these disputes lies the financial sustainability of journalism. Publishers argue that unlicensed AI use undermines their ability to generate revenue through subscriptions and advertising.

The New York Times reported that AI-generated summaries of its articles could reduce traffic to its site by as much as 50%, a devastating blow to its business model. Similar concerns have been raised by other publishers, who see AI-generated outputs as a direct competitor to their original content.

The lawsuit against Cohere underscores these fears. By allegedly producing outputs that closely replicate proprietary material, the plaintiffs claim the AI firm has exploited their work without compensation, threatening the already fragile economics of journalism. The legal battle highlights the need for clearer regulations governing AI development and intellectual property rights.

The Future of Generative AI and Intellectual Property

The lawsuit against Cohere is one of several that could reshape how generative AI systems operate. If courts rule in favor of publishers, AI companies may be required to overhaul their data collection practices, potentially increasing costs and slowing innovation.

On the other hand, a ruling in favor of AI developers could embolden companies to continue using publicly available datasets, further complicating the debate over fair use and intellectual property.

For Cohere, the stakes extend beyond this lawsuit. The company’s reliance on retrieval-augmented generation and massive datasets raises questions about how AI systems can balance innovation with accountability. The plaintiffs in the case argue that stricter oversight and licensing requirements are necessary to protect content creators while allowing AI technologies to evolve responsibly.

As the industry grapples with these challenges, the outcomes of lawsuits like this one could define the future of AI development and its relationship with media organizations.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x