HomeWinBuzzer NewsDeveloper Alleges Perplexity AI Ignores Robots.txt Protocols

Developer Alleges Perplexity AI Ignores Robots.txt Protocols

Developer Robb Knight found Perplexity AI was accessing MacStories content despite there being a direct block on the crawler.

-

A developer has flagged concerns over how Perplexity AI interacts with web directives meant to control bot access. Despite setting blocks at various levels, Perplexity AI seemingly bypasses these to gather content.

Robb Knight, leveraging his background in web development, noticed unauthorized access by Perplexity AI on protected pages of both Radweb and MacStories. This prompted a more thorough investigation into the bot's behavior.

Radweb is a specialist on building web and mobile applications, with expertise in cloud technology They offer services like designing user interfaces, implementing server infrastructure, and coding for both front-end and back-end functionalities. MacStories is a website and a network of podcasts all about products

Measures to Block AI Bots

Starting March 30th, attempts were made to keep out PerplexityBot by disallowing it via robots.txt files. Robots.txt is a text file on a website that tells search engine crawlers, like 's, which parts of the site they can and can't access. It's like a set of instructions for the crawlers, letting them know which areas are off-limits.

By June 14th, more rigorous server-side blocks using nginx were introduced to return a 403 forbidden status for blocked user agents. Federico Viticci of MacStories confirmed on Mastodon the direct blocking. Despite these actions, Perplexity AI continued scraping restricted content, leading to suspicions about the bot's compliance.

Testing Blocking Effectiveness

The developer conducted tests to ensure the blocking mechanisms were functioning. Using a detection project, it was validated that the server correctly recognized and blocked PerplexityBot, issuing a 403 response. This led to direct inquiries to Perplexity AI regarding their content access techniques.

Initially, Perplexity AI denied having the capabilities to bypass robots.txt or gather restricted content. However, they eventually acknowledged that summarizing such content went against ethical standards and should not have happened, sparking further scrutiny of their practices.

Server Log Examination

An analysis of server logs unveiled that rather than using its declared PerplexityBot identifier, the AI utilized a common user agent string typically associated with Google Chrome on Windows 10. This deceptive practice indicated the use of headless browsers to circumvent blocks, hiding its true identity.

The developer noted the difficulty in preventing Perplexity AI from accessing their content, as the bot's IPs did not correspond to known ranges for the company. This has led to contemplating a GDPR request and joining Perplexity AI's Discord for further information.

Forbes Complaint Against Perplexity AI

Forbes has recently called out Perplexity AI for allegedly copying its content without giving credit. The dispute revolves around an article on Eric Schmidt's drone company that Forbes claims was copied by Perplexity in an AI-generated podcast. Forbes was part of a group of news services that criticized Perplexity last week, and has followed up its complaint with a full feature. 

Perplexity AI was started in 2022 by Aravind Srinivas, Denis Yarats, Johnny Ho, and Andrew Konwinski. The company has secured over $100 million in venture capital and is currently seeking to raise an additional $250 million, with a target valuation of $2.5 billion to $3 billion.

Luke Jones
Luke Jones
Luke has been writing about all things tech for more than five years. He is following Microsoft closely to bring you the latest news about Windows, Office, Azure, Skype, HoloLens and all the rest of their products.