Imagine you owned a bookstore. Most of your revenue depends on customers coming in and buying books, so you set up different aspects of the business around that activity. You might put low-cost “impulse buy” items near the checkout or start selling coffee as a convenience. You might even partner with publishers to put displays of popular bestsellers in high-visibility locations in the story to drive sales.
Now imagine one day a robot comes in to buy books on behalf of someone. It ignores the displays, the coffee kiosk, and the tchotchkes near the till. It just grabs the book the person ordered, pays for it, and walks out. The next day 4 robots come in, then 12 the day after that. Soon, robots are outnumbering humans in your store, which are dwindling by the day. You soon see very few sales from nonbook items, publishers stop bothering with those displays, and the coffee goes cold. Revenue plummets.
In response, you might start charging robots a fee to enter your store, and if they don’t pay it, you deny them entry. But then one day a robot that looks just like a human comes in—to the point that you can’t tell the difference. What do you do then?
This analogy is basically what the publishing world is going through right now, with bot traffic to media websites skyrocketing over the past three months. That’s according to new data from TollBit, which recently published its State of the Bots report for the first quarter of 2025. Even more concerning, however, is that the most popular AI search engines are choosing to ignore long-respected standards for blocking bots, in some cases arguing that when a search “agent” acts on behalf of an individual user, the bot should be treated as human.
The robot revolution
TollBit’s report paints a fast-changing picture of what’s happening with AI search. Over the past several months, AI companies have either introduced search abilities or greatly increased their search activity. Bot scraping focused on retrieval-augmented generation (RAG), which is distinct from training data, increased 49% over the previous quarters. Anthropic’s Claude notably introduced search, and in the same period ChatGPT (the world’s most popular chatbot by far) had a spike in users, plus deep research tools from all the major providers began to take hold.
At the same time, publishers increased their defenses. The report reveals that media websites in January were using various methods to block AI bots four times as much as they were doing in a year before. The first line of defense is to adjust their website’s robots.txt file, which tells which specific bots are welcome and which ones are forbidden from accessing the content.
The thing is, adhering to robots.txt is ultimately an honor system and not really enforceable. And the report indicates more AI companies are treating it as such: Among sites in TollBit’s network, bot scrapes that ignore robots.txt increased from 3.3% to 12.9% in just one quarter.
Part of that increase is due to a relatively new stance the AI companies have taken, and it’s subtle but important. Broadly speaking, there are three different kinds of bots that scrape or crawl content:
- Training bots: These are bots that crawl the internet to scrape content to provide training data for AI models.
- Search indexing bots: Bots that go out and crawl the web to ensure the model has fast access to important information outside its training set (which is usually out of date). This is a form of RAG.
- User agent bots: Also a form of RAG, these are crawlers that go out to the web in real time to find information directly in response to a user query, regardless of whether the content it finds has been previously indexed.
Because No. 3 is an agent acting on behalf of a human, AI companies argue that it’s an extension of that user behavior and have essentially given themselves permission to ignore robots.txt settings for that use case. This isn’t guesswork—Google, Meta, and Perplexity have made it explicit in their developer notes. This is how you get human-looking robots in the bookstore.
When humans go to websites, they see ads. Humans can be intrigued or enticed by other content, such as a link to a podcast about the same topic as an article they’re reading. Humans can decide whether or not to pay for a subscription. Humans sometimes choose to make a transaction based on the information in front of them.
Bots don’t really do any of that (not yet, anyway). Large parts of the internet economy depend on human attention to websites, but as the report shows, that behavior drops off massively when someone uses AI to search the web—AI search engines provide very little in the way of referral traffic compared to traditional search. This of course is what’s behind many of the lawsuits now in play between media companies and AI companies. How that is resolved in the legal realm is still TBD, but in the meantime, some media sites are choosing to block bots—or at least are attempting to—from accessing their content at all.
For user agent bots, however, that ability has been taken away. The AI companies have always seen data harvesting in the way that’s most favorable to their insatiable demand for it, famously claiming that data only needs to be ">“publicly available” to qualify as training data. Even when they claim to respect robots.txt for their search engines, it’s an open secret that they sometimes use third-party scrapers to bypass it.
Unmasking the bots
So apart from suing and hoping for the best, how can publishers regain some, well, agency in the emerging world of agent traffic? If you believe AI substitution threatens your growth, there are additional defenses to consider. Hard paywalls are easier to defend, both technically and legally, and there are several companies (including TollBit, but there are others, such as ScalePost) that specialize in redirecting bot traffic to paywalled endpoints specifically for bots. If the robot doesn’t pay, it’s denied the content, at least in theory.
Collective action is another possibility. I doubt publishers would launch a class action around this specific relabeling of user agents, but it does provide more ammunition in broader copyright lawsuits. Besides going to court, industry associations could come out against the move. The News/Media Alliance in particular has been very vocal about AI companies’ alleged transgressions of copyright.
The idea of treating agentic activity as the equivalent of human activity has consequences that go beyond the media. Any content or tool that’s been traditionally available for free will need to reevaluate that access now that robots are destined to be a growing part of the mix. If there was any doubt that simply updating robots.txt instructions was adequate, the TollBit report blew it out of the water.
The stance that “AI is just doing what humans do” is often used as a defense for when AI systems ingest large amounts of information and then produce new content based on it. Now the makers of those systems are quietly extending that idea, allowing their agents to effectively impersonate humans while shopping the web for data. Until it’s clear how to build profitable stores for robots, there should be a way to force their masks off.
Войдите, чтобы добавить комментарий
Другие сообщения в этой группе

Tesla deployed a small group of

For 13 years, Subway Surfers’ download rate has been consistent: about one million new installs every single day.
Half of those downloads come from users upgrading to new

Misbehavior on digital platforms can be tricky to manage. Issue warnings, and you risk not deterring bad behavior. Block too readily, and you might drive away your user base and open yourself to a

You’d be forgiven for forgetting that there was a time when Microsoft Edge was basically the web browser that opened when you accidentally clicked a link that didn’t default to opening in Chrome o



Do you receive login security codes for your online accounts via text message? These are the six- or seven-digit numbers sent via SMS that you need to enter along with your password when trying to