Artificial Intelligence & Machine Learning
,
Data Security
,
Next-Generation Technologies & Secure Development
Allegations of Improper Data Collection Are Not New for Perplexity
Artificial intelligence firm Perplexity is facing allegations it sidestepped long-standing internet norms in pursuit of data. Cloudflare accused the AI search engine startup of positioning itself as a Google alternative, ignoring website restrictions and disguising its scraping activities.
See Also: Taming Cryptographic Sprawl in a Post-Quantum World
The network security and infrastructure company’s engineers on Monday outlined what they described as behavior consistent with attempts to bypass content restrictions. Despite publishers explicitly disallowing Perplexity in their robots.txt files – the decades-old standard used to signal what content should not be indexed or scraped – the company allegedly continued to access site content.
“This undeclared crawler utilized multiple IPs not listed in Perplexity’s official IP range,” Cloudflare wrote. Researchers said Perplexity’s bots rotated IP addresses and altered their user-agent strings to resemble a Google Chrome browser on macOS, tactics commonly used to bypass firewall rules and detection systems.
Cloudflare reported observing this behavior across tens of thousands of domains and millions of content requests per day. “We were able to fingerprint this crawler using a combination of machine learning and network signals.”
The Robots Exclusion Protocol was first introduced in 1994 by engineer Martijn Koster and formally standardized in 2022. It was created to allow websites to set boundaries for crawlers. Compliance with the protocol is voluntary, but it is a widely adopted norm among legitimate web crawlers.
Cloudflare said the crawler activity continued despite sites implementing both robots.txt restrictions and web application firewall rules to block known user agents such as PerplexityBot and Perplexity-User. When these were blocked, Perplexity used alternate methods that obscured its identity. Cloudflare said it has removed Perplexity from its list of verified bots and introduced new detection rules to block future activity.
Perplexity has not offered a public response, but company spokesperson Jesse Dwyer told TechCrunch that Cloudflare’s blog post was a “sales pitch,” arguing that the bot named in the Cloudflare blog “isn’t even ours.” Dwyer also claimed that the screenshots in Cloudflare’s blog showed that “no content was accessed.” Perplexity did not respond to Information Security Media Group’s request for comment.
Allegations of improper data collection are not new for Perplexity. Forbes had previously accused the company of publishing an article that appeared to closely mirror its reporting, describing it as “cynical theft.” Wired also reported suspicious bot traffic that appeared to ignore robots.txt exclusions and linked the activity to Perplexity. In both cases, the accusations involved scraping or summarizing proprietary content without attribution or permission.
Other AI firms have faced similar scrutiny. Reddit in June sued Anthropic, alleging that the AI company scraped content in violation of its user agreement and California’s competition law. Reddit CEO Steve Huffman told The Verge that companies such as Microsoft, Anthropic and Perplexity acted “as though all of the content on the Internet is free for them to use.”
The dynamic between AI firms and content providers is becoming increasingly contentious: Search crawlers in the early days often provided mutual value by helping users find websites while driving traffic and revenue to publishers. But AI bots use scraped data for model training or on-the-fly retrieval, bringing no direct benefit to publishers.
Bot mitigation firm TollBit showed an increase in scraping activity. Its State of the Bots Q1 2025 report showed an 87% increase in scraping compared to the previous quarter, with the share of bots ignoring robots.txt directives jumping from 3.3% to 12.9%. In March alone, TollBit recorded 26 million scrapes that bypassed such directives.
The imbalance between scraped data and value returned to publishers is stark. On sites monitored by TollBit, Bing generated one human referral for every 11 scrapes. OpenAI’s ratio was 179:1. For Perplexity, it was 369:1. Anthropic’s bots reportedly performed 8,692 scrapes per referred visitor.
Perplexity has attempted to address some of these concerns through its Publishers’ Program, which offers payment to select content providers. Other AI companies, such as OpenAI, have signed or are eyeing licensing deals with major publishers. Reddit has monetized its data through partnerships as well. But many websites are excluded from such agreements and unauthorized scraping continues (see: OpenAI and Microsoft Face New York Times Copyright Lawsuit).
Cloudflare recently introduced a tool to block AI bots from scraping content and launched a marketplace that allows publishers to charge AI companies for access. The company’s executives have argued that AI poses structural challenges to how the web currently functions for content creators and site operators.