Update May 6, 2025 4:53 am CEST: This article has been updated to incorporate clarifications provided by Google regarding its AI training data practices and publisher opt-out controls. The original story drew from initial reporting by Bloomberg, which has subsequently been updated. Google has emphasized that its “Google-Extended” opt-out mechanism is specifically for certain AI models like Gemini and is distinct from the established controls, such as robots.txt, that govern content use in Google Search.
Google utilizes web content to train its core search features, including the AI Overviews that generate summaries atop results. Recent court testimony from a company executive, and subsequent clarifications from Google, have shed light on how different publisher opt-out mechanisms function. Specifically, the “Google-Extended” tool, introduced for publishers to prevent their content from being used for AI training of models like Gemini, operates separately from the controls governing Google Search.
This distinction was a focal point during the remedies phase of the high-profile US v. Google antitrust case in Washington D.C. Eli Collins, a Vice President of Product at Google DeepMind, testified regarding the “Google-Extended” directive.
This directive, introduced in September 2023 and intended for a site’s robots.txt file, is designed to restrict data use by the Google DeepMind research division for models like Gemini. Collins confirmed that Google-Extended does not prevent the Google Search organization from using content to refine its own AI-driven features, as Search has its own set of controls.
Google clarified the scope of Google-Extended: “Google-Extended is a standalone product token that web publishers can use to manage whether content Google crawls from their sites may be used for training future generations of Gemini models that power Gemini Apps and Vertex AI API for Gemini and for grounding (providing content from the Google Search index to the model at prompt time to improve factuality and relevancy) in Gemini Apps and Grounding with Google Search on Vertex AI. Google-Extended does not impact a site’s inclusion in Google Search nor is it used as a ranking signal in Google Search.”
During the testimony, as reported by Bloomberg (whose article has since been updated), Department of Justice lawyer Diana Aguilar asked: “Once you take the Gemini” AI model “and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training [via Google-Extended], correct?”
Collins affirmed, “Correct — for use in search.” This confirmed that content opted out via Google-Extended could still be used by Search, because Google-Extended is not the control mechanism for Search data usage.
A Clarified Distinction, Under Antitrust Scrutiny
While Collins’ testimony brought the operational details of these controls into the antitrust spotlight, Google had previously indicated this separation. Shortly after introducing Google-Extended, the company clarified in October 2023 that this specific AI training control did not apply to its Search Generative Experience (SGE) – the experimental feature that evolved into AI Overviews.
At that time, Google stated SGE, being a Search feature, was governed by standard webmaster controls affecting search visibility, like `noindex` meta tags or traditional robots.txt `disallow` rules. Google reiterated this point, stating: “Google has a separate way for publishers to manage their content in Search via the well-established robots.txt web standard.”
A Google spokesperson had previously advised website administrators: “For Search, website administrators should continue to use the Googlebot user agent through robots.txt and the NOINDEX meta tag to manage their content in search results, including experiments like Search Generative Experience”, as reported by Search Engine Roundtable.
This setup means publishers need to understand the distinct purposes of different controls. Using Google-Extended addresses the use of content for training general AI models like Gemini, but for managing how content is used within Google Search and its AI features like AI Overviews, established Search-specific controls (e.g., robots.txt directives for Googlebot) must be employed.
Some publishers have also explored “preview controls” (`nosnippet`, `max-snippet`), which Google suggests can limit how much content is displayed in AI Overviews, though this doesn’t directly address the underlying use of the data for training if not otherwise restricted by Search controls.
The creation of the Google-Extended control itself followed pressure, notably from bodies like the French Competition Authority, which scrutinized Google’s AI data practices and the need for opt-out mechanisms.
Broader Industry Conflicts Over AI Data
Google’s data practices exist within a wider context of tension between AI developers and content creators. Numerous publishers and media groups have expressed alarm or taken action, such as proactively blocking AI web crawlers, over the uncompensated use of their material to build valuable AI models. Cloudflare in March launched AI Labyrinth, a system that misleads unauthorized AI crawling bots by trapping them in auto-generated content mazes to offer publishers an additional option.
Meanwhile, lawsuits are ongoing, with publisher Ziff Davis suing OpenAI for allegedly scraping content from sites like PCMag and IGN while ignoring opt-out signals, and The New York Times pursuing a high-profile case against both OpenAI and Microsoft over alleged widespread copyright infringement.
While some AI companies like OpenAI are pursuing content licensing deals with publishers, Google has historically relied heavily on its ability to index the public web, a practice formalized in a July 2023 privacy policy update stating: “For example, we use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” Google also secured a reported $60 million annual deal with Reddit in February 2024.
The scale of data involved is immense. An internal Google document referenced during Collins’ testimony indicated that applying publisher opt-outs (via Google-Extended) filtered out 80 billion content “tokens” (pieces of text data used for training) from a 160 billion token dataset intended for DeepMind training – essentially removing half the collected data based on publisher preferences for that specific use case (i.e., training models like Gemini).
Testimony also touched upon internal discussions involving Google DeepMind CEO Demis Hassabis about the potential value of using Google’s vast search data, including ranking signals, to further enhance AI model performance as reported by Bloomberg.
Antitrust Implications and Google’s Defense
This detailed look at Google’s data practices is central to the ongoing antitrust remedies trial. Judge Amit Mehta, having already found Google illegally maintained its search monopoly, must now decide on the DOJ’s proposed fixes. These include potentially forcing a sale of the Chrome browser and banning the types of exclusive default placement deals (including for AI like Gemini) that helped cement Google’s dominance.
The DOJ contends Google is unfairly leveraging its search power and data access in the AI sphere, pointing to large payments to Samsung for Gemini pre-installation as echoing past anticompetitive behavior.
Google counters that its success stems from superior products and that AI competition is robust, with chatbot makers often striking direct deals with content providers for specific data needs, bypassing reliance on web indexes. CEO Sundar Pichai argued strongly against the DOJ’s remedies, calling data-sharing demands a “de facto divestiture of search” that would undermine the company’s ability to fund research and development.
While Google previously introduced copyright indemnity for the output of certain enterprise AI tools, the recent clarifications underscore the importance for publishers to understand the distinct controls available for managing their content’s use across Google’s various AI applications. A decision from Judge Mehta on the antitrust remedies is expected later this year.