Mustafa Suleyman, head of Microsoft’s AI division since last March, has sparked a fierce debate by asserting that content on the open web is available for anyone to copy and use freely, including scraping bots that feed systems to train AI models.
His remarks arrive amid ongoing legal battles accusing both Microsoft and its partner OpenAI of unlicensed use of copyrighted material in training their AI systems.
AI Scraping And the Fair Use Debate
During an Aspen Ideas Festival interview with CNBC, Suleyman stated there has been a long-standing assumption, dating back to the ’90s, that online content is like “freeware” and can be reused without restriction. This viewpoint, however, is at odds with U.S. copyright laws, which automatically shield created works.
The principle of fair use, a legal doctrine that permits limited use of copyrighted material without seeking permission, depends on various factors, including the purpose and nature of the use, portion used, and effect on the market value of the original work.
While some AI firms argue that using copyrighted content for training is allowed under fair use, the straightforward nature of Suleyman’s comments has drawn attention.
Suleyman did make an exception by noting that websites explicitly opting out of non-indexing scraping merit respect. This differentiation points to the issue’s complexity, as no universal consensus exists on the appropriate use of web content.
OpenAI has been proactive in securing content licensing deals with TIME magazine, The Atlantic and Vox Media, News Corp, and the Financial Times, reflecting a strategic shift towards more formalized content usage frameworks.
Robots.txt in the Spotlight
Suleyman mentioned the use of robots.txt, a mechanism known as Robots Exclusion Protocol which instructs search engines on how to handle their content. Suleyman noted that while robots.txt may express a preference not to be scraped, the legal enforceability of this file is still to be determined by future court decisions.
The Robots Exclusion Protocol recently became a focal point of debate, as the AI search provider Perplexity AI, which is backed by Amazon founder Jeff Bezos and Nvidia, appears to ignore it when feeding its AI with relevant data. Amazon Web Services (AWS) is investigating allegations that Perplexity AI has been conducting unauthorized web scraping using AWS infrastructure.
The inquiry aims to assess allegations of whether Perplexity AI has breached the Robots Exclusion Protocol by extracting data from websites that explicitly restrict such activities.
The AWS investigation commenced following reports that claim Perplexity AI replicates existing articles from major news outlets. Forbes recently called out Perplexity AI for allegedly replicating its content without due credit.
Suleyman Says Opt-Out Should be Respected
Supporters of broad data access believe it’s essential for AI progress to use freely available data for AI training, whereas critics argue that intellectual property rights must be respected. Several high-profile lawsuits have already been filed against Microsoft and OpenAI by copyright-advocates.
Eight prominent newspapers owned by hedge fund Alden Global Capital have sued both companies, suggesting that both ChatGPT and Copilot possess the ability to reproduce exact passages from articles and a group of thirteen other plaintiffs sued the companies for training of artificial intelligence models with data scraped from the web.
The core of the accusation lies in the alleged training of artificial intelligence models with data scraped from the web, purportedly without securing proper consent from individuals. Moreover, the lawsuit claims continuous harvesting of personal information via API integrations with product offerings.
The outcomes of the lawsuits against Microsoft and OpenAI could serve as critical legal benchmarks for the tech community.
Last Updated on November 7, 2024 3:45 pm CET