OpenAI has launched GPTBot, a web crawler that will be used to improve the company's AI models such as ChatGPT. GPTBot is designed to be more privacy-focused than other web crawlers, and it will only crawl websites that have opted in to being crawled.
“Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies,” OpenAI said in the blog post.
The feature lets website operators block the web crawler that OpenAI uses to scrape their site's content and use it to train its AI models. The web crawler, called GPTBot, can be blocked by either adding a line to the site's Robots.txt file or by blocking its IP address.
OpenAI said this feature was designed to respect the preferences of website owners who may not want their data to be used for AI research. Website owners who do not want GPTBot to crawl their site can add the following code to their robots.txt file: User-agent: GPTB – Disallow: /
his feature may be the first step towards allowing internet users to choose whether they want their data to be used for training large language models or not. This issue has been a source of controversy and debate, as many sites and creators have objected to the use of their data by AI companies without their consent or compensation.
Choosing How AI Accesses Your Website
Some examples of this are Reddit and Twitter, which have tried to restrict the free use of their users' posts by AI companies, and authors and other creatives, who have sued over alleged unauthorized use of their works. The question of data privacy and consent has also attracted the attention of lawmakers, who have raised it in several Senate hearings on AI regulation last month.
Some companies and organizations have proposed different ways to mark data as not for training, such as a “NoAI” tag suggested by DeviantArt last year, or an anti-impersonation law advocated by Adobe. AI companies, including OpenAI, have also agreed with the White House to develop a watermarking system to let people know if something was generated by AI, but they have not made any commitments to stop using internet data for training.
Blocking the GPTBot is one way for website owners to exercise some control over their data, but it does not affect the data that has already been scraped from their site and used for training ChatGPT.