A GitHub-hosted project offers a curated robots.txt file designed to block known AI crawlers from accessing website content.
The initiative called ai.robots.txt aims to protect online materials from being used to train large language models (LLMs) without permission.
By offering a simple file that lists known AI crawlers configured for blocking, the project invites developers to assert greater control over their data and encourages AI companies to adhere to ethical practices.
The project reflects growing frustration among developers and publishers with the opaque methods AI systems use to collect training data. While it cannot enforce compliance, their curated robots.txt places the spotlight on the ethical responsibilities of AI companies as their technologies reshape the internet.
How the Curated Robots.txt Works
The offered robots.txt file includes an open-source list of user agent names associated with AI crawlers, sourced partly from Dark Visitors, an initiative that tracks bot activity.
Developers are encouraged to contribute updates by submitting pull requests on GitHub, ensuring that the list remains current as new bots emerge. While it relies on voluntary adherence by AI companies, the project provides a much-needed tool for site owners seeking to manage how their content is accessed and used.
While the curated robots.txt file provides a valuable tool for developers, its effectiveness is limited by the reliance on voluntary compliance. Many AI crawlers operate outside the ethical boundaries respected by traditional web crawlers like Googlebot.
Advanced techniques such as headless browsing, which enables bots to mimic human behavior, make it harder to identify and block unauthorized access.
Server-side measures, such as IP blocking and customized firewall rules, offer additional protection but are not foolproof.
More and More Crawlers are Harvesting for AI
Microsoft’s Bing crawler reportedly is respecting robots.txt for its search index, as became clear when Reddit started offering its content exclusively to Google and blocking other search engines like Bing and DuckDuckGo. However, this was primarily about crawling pages for search and not training of Large Language Models (LLMs).
As shows the case of Meta, big tech companies are not shying away from using shady tactics to get data for their AI training. The company reportedly has been using unauthorized datasets with pirated books and academic articles.
YouTube creators are affected in a similar way, as show lawsuits filed against the Google-subsidiary and Nvidia, that alledge used videos without permission for AI training.
Perplexity AI: A Case with Compliance Issues
The need for advanced crawling bot blocking became particularly evident last year through incidents involving Perplexity AI. Developer Robb Knight uncovered that Perplexity AI accessed content from his websites, Radweb and MacStories, despite explicit robots.txt directives and server-side blocks configured to return “403 Forbidden” responses.
An analysis of server logs revealed that PerplexityBot used deceptive techniques to bypass the restrictions, such as operating through headless browsers and masking its identity with common user agent strings like Google Chrome on Windows.
These methods allowed it to evade detection while scraping restricted content. Initially, Perplexity AI denied the ability to circumvent these restrictions. However, they later admitted to ethical lapses, stating: “Summarizing restricted content should not have happened.”
MacStories’ Federico Viticci confirmed Knight’s findings, explaining that additional server-level measures had been deployed to block PerplexityBot. However, even these advanced protections were not foolproof, highlighting the difficulty of ensuring compliance with ethical standards in web crawling.
In Perplexity AI’s case, Knight noted that its IP ranges did not match any publicly known company-owned addresses, complicating enforcement efforts further. This highlights the need for more robust tools and regulatory frameworks to address the challenges posed by increasingly sophisticated AI bots.
Perplexity is however not alone in this practice as the mounting numbers of copyright lawsuits against AI developers show. The New York Times is involved in an expensive lawsuit against Microsoft and OpenAI over content theft.
The case is just one example of a larger wave of dissatisfaction among media outlets, which have called for stricter standards to govern AI data collection.