One year ago, we declared the first Content Independence Day, and we gave website owners the means to take back control of their content. The deal between crawlers and website owners that had held up for 30 years — we crawl you, and you get referrals — was no longer true. AI was taking everything and sending back nothing, presenting an existential threat to website owners. And so we launched a one-click "Block AI Bots" option, along with a Pay-Per-Crawl marketplace.
A lot has changed in a year. Last July, conversations around “AI bots” centered around blocking AI training without compensation, pointing to the win–lose deal where content was used for model training with no value driven back to the website owner. But a desire for more nuance has emerged: Content owners still want to be able to protect their content, and they should be compensated for the original content that they work hard to create, curate, and share. We also know that locking down content isn’t a one-size-fits-all solution; website owners want more options than resorting to “block all automation, every time.”
If you run a small site, the problem isn’t just that someone could train models on your content — it's that nobody can find you in the first place. So you have to make a Faustian bargain: either show up in search and let AI train on you, or risk losing discoverability. This unfairly advantages incumbent search providers if they use the same bots for both search and training; and this unfair advantage incentivizes new players to be evasive as they try to close the competitive gap.
Today, AI can be in anything. Google search has changed from being sorted by AI to being a full answer engine that answers your question directly on the results page. And Google is not unique in this position — this is the direction in which “search” is moving.
We could debate the cutoff for what qualifies as “AI” today, just to find that the standard changes tomorrow. So, instead of defining a bot primarily as “AI” or not, our updated approach to classification will ask deeper questions about bot or agent behavior: What are they doing on my site? What are they storing? And how will they reshare my content?
To address these questions, we need a more nuanced view — a pragmatic taxonomy that aligns with the AI use cases our customers care about. So we are opening the discussion beyond AI training alone and focusing on three AI use cases that we want all customers to be able to manage:
Search: any behavior that collects or indexes your content, so it can answer questions about it later. The key is that Search is proactively building a database of your site to later respond to queries with. Site owners should expect to get referral traffic or other equitable compensation as a result.
**Agent: automated **behavior that is acting, usually in real time, on a person's behalf, to get something done right now. This includes chat fetch bots (e.g., ChatGPT-User) and browser-use agents (e.g., Gemini or Claude driving Chrome). The key is that it visits your web application in order to complete a job, and often there's a human waiting on the other end.
Training: a crawler taking your content to train or fine-tune a model. The key is that your data is permanently absorbed into the underlying architecture of the AI to improve its capabilities.
Many popular crawlers on the web fall into one of the classifications above; some fall into multiple. We classify plenty of other behaviors beyond the three above — including ads verification, feed fetching, and agentic transactions (more on this below). But we believe it should be simple for all website owners to manage access for these three AI-centered use cases. We believe that bot operators should separate their crawlers because that creates more transparency for website owners: allowing them to better understand why a given crawler is visiting them, as well as to better manage the access they extend to that crawler. If a company runs automation that builds Search indexes, acts as an Agent, and collects data to Train their models, then we strongly encourage that company to separate the automation into three separate crawlers.
We want a classification system that is scalable and representative of the world of automated traffic as it evolves. Tracking a bot’s purposes is nothing new, but our new taxonomy involves a few updates that better represent the state of bot traffic today. Most notably, we want to recognize that bots that have multiple purposes should be tracked with all purposes, not just one of them.
New options to manage AI traffic
We want to provide more options for managing different kinds of AI traffic, to *** all* website owners on the Cloudflare network.**
The managed preset to “Block AI bots” that we’ve announced in the past included single-purpose bots that crawled data for model training, as shown below:
Screenshot of the existing setting to manage AI bot traffic on July 1, 2025.
But not all AI use is the same, and we want our customers to have the controls they need. So, we’re launching the ability to manage AI traffic based on *** three* major use cases: Search, Agent, and Training** crawlers. With these new options, our customers can more finely tune how they manage AI bot traffic — including customers on our Free tier.
Screenshot of the new options to manage AI bot traffic on July 1, 2026.
On September 15, 2026, we’ll be setting new defaults for each of these three classifications. For all new domains onboarding to Cloudflare, the categories of Training and Agent will be blocked by default **on the pages that display ads, **while Search will remain allowed by default.
An ad is a signal that a website owner meant for a person to land there and see it — something monetizable that fuels the business. So, on those pages, we treat human attention as the end goal, and keep away the bots that may prevent this attention (i.e., Training and Agent bots). On the other hand, Search is the behavior that most naturally funnels back visitors, and we believe it’s in the interest of most site owners to allow this.
Another change that will apply on September 15 is that multi-purpose crawlers (specifically those that combine Search with Training) will be allowed/blocked according to all of their behaviors, in line with our call for transparency for website owners. Since the defaults will be enforced by the most restrictive applicable rules, multi-purpose crawlers such as Googlebot, Applebot, and BingBot will be blocked by customers who have selected to block Training (either through the new options to manage AI traffic, or through the legacy Block AI bots service).
Of course, customer choice is paramount: if a website owner wants to opt out of these new default configurations, they can easily mark this in their Security settings any time leading up to September 15, which will confirm that they want no changes on Training crawlers that also crawl for Search purposes. We’ll also continue to notify customers of the upcoming change to defaults as we approach September 15 to ensure that customers who want to choose settings different from the defaults have the opportunity to do so.
BotBase: a new visibility plane for Enterprise customers
We’re also excited to launch a major visibility update as a new feature of Enterprise Bot Management. As Cloudflare’s directory of tracked bots has grown, so has the desire to manage these bots in sensible groupings and to understand more detail about a particular bot.
Introducing BotBase. BotBase is our new database tracking all known bots, including Verified bots and agents. This database provides a comprehensive, searchable view of our entire directory of bots, directly on the Cloudflare dashboard. We’re tackling visibility first, but, later this year, we’ll expand BotBase to provide a direct control center for known automated content on your website.
With this new view, Enterprise Bot Management customers can see the full catalogue of all Verified bots/agents and where they are classified in this updated taxonomy — a view we’ve never shown dynamically on the Cloudflare dashboard before. Customers who want to precisely target a specific bot can also easily filter for all traffic from this bot, plus copy the detection ID to use in Security rules. All of this is now live within a dedicated page, which can be accessed through the Bot Management configuration card.
As we built BotBase, we wanted to account for all of the pieces of information that would allow us to build scalable, powerful insights from bot to bot. One of these pieces is a cornerstone for our updated taxonomy, which is based on what a bot may do on your site — its behavior. We separate these classifications as shared below, and each bot is classified with one or more of these behaviors.
Bot classification | Behaviors and uses | Search | Crawling to scan your site to help it appear in search engine results | Agent | User-directed agents visiting a page on behalf of a human | Training | Crawling to train or fine-tune models | Transact | Checkout actions on behalf of users | Data Collection | Includes price scraping, competitive intelligence gathering, and third-party analytics | Security Testing | Includes vulnerability scanning and penetration testing | SEO | SEO crawling, site auditing, accessibility checks | Ads Verification | Ad placement verification, ad fraud detection | Social / Link Preview | Link previews for social platforms and messaging apps | Feed Fetching | Includes RSS readers, podcast aggregators, and news feed bots | Monitoring & Operations | Includes uptime monitoring, webhooks, and health checks |
Bold italicized rows indicate the new configurable options that are available to all customers.
How does a crawler use my content?
Another piece of information we’ve heard is important to our customers is a bot’s** content use — what a bot may keep and reshare after it has crawled your content.** To address this, we are building capabilities for Bot Management customers to select and block based on the “content use.” This setting can be set to one of three levels, from least to most permissive:
immediate
— interact, but store and reuse nothing
reference
(default) — index, excerpt, and link back
full
— summarize and reproduce
These values can be combined with bot classifications to express nuanced rules, such as “allow all bots that are used for Search, SEO, and Ads Verification, but only up to the reference
use level.” This allows website owners to make decisions in sensible groupings rather than manage individual bot-by-bot rules**.**
To further support this, starting today, we're testing a new signal, use
, that extends Content Signals and lives in your robots.txt. This extends the three fields of the first version of Content Signals with a fourth, optional field that expresses the same preference as above:
use=immediate
use=reference
use=full
As with all other items listed in the robots.txt file, the values of content use signal a website owner’s preference, rather than issuing blocks directly. We’re now adding support for this extension: all customers who have already enabled managed robots.txt — which prepends the preference to robots.txt that crawling for search is okay, but that crawling for training is not — will now have the additional preference of use=reference
added to their robots.txt.
User-agent: *
Content-Signal: search=yes,ai-train=no
Allow: /
*The contents of Cloudflare managed robots.txt with the original Content Signals values. *
User-agent: *
Content-Signal: search=yes,ai-train=no,use=reference
Allow: /
The contents of Cloudflare managed robots.txt with the added parameter.
We’re also starting to track content uses for every bot in BotBase, and when we discover a bot abusing these signals, it will lose the “Verified” status, resulting in it no longer being allowed. Today, bots that reproduce in full cannot have the Verified status.
What does it mean for a bot to be Verified?
Speaking of “Verified,” the definition of Verified is being updated to reflect the upcoming changes to default allow and block baselines. Previously, all Verified bots were allowed by default, which was reflected in our basic Bot Fight Mode offering to block unwanted automatic traffic and in our rule templates for Enterprise Bot Management customers.
Starting today, we’re adjusting this to add nuance: non-verified bots are still default blocked, but we are no longer viewing Verified as “default allowed.” Now, the Verified label makes a bot allowable with its relevant category, meaning the allowed category (e.g., allowing Search) will determine what is allowed to access a website.
To balance this change, we’re opening up the process of becoming a Verified bot, and making it more transparent, too. To "Verify" a bot, a bot operator needs to show two things: that you represent yourself honestly, and you don't abuse the access that honesty earns. And to make this easier on bot operators, we’re currently building management tools for bot operators to better ensure they are accurately represented by Cloudflare’s classification system (to be announced in the near future).
A preview screenshot of the upcoming platform built directly for bot operators who are part of or want to be a part of BotBase, the next generation of the Cloudflare Bots Directory.
Experimenting with transitive trust
One more piece: The bot (or agent) at your door increasingly isn't run by the company that built it. A platform like Cloudflare’s Developer Platform runs automations for thousands of different operators at once, ranging from enterprises to a developer you've never heard of. You might trust Stripe, but you don't necessarily trust everyone who wired Stripe's tools into a weekend project.
We call the case of (site owner → bot owning company → end user) a matter of transitive trust, and we're proposing to utilize the existing Forwarded header as defined in RFC 7239 that rides along with the request and allows “proxy components to disclose information lost in the proxying process.”
This is similar to what X-Forwarded-For
does for IP addresses, or X-Forwarded-Host
does to preserve the original Host header. So when a website owner says, "Allow this operator," that preference will hold, whether the operator comes to you directly or through three layers of intermediaries that are trusted. More details can be found in our documentation, with a brief example to show the format below.
Forwarded: for="openai"
Adding the extension with content-use discussed above, the header addition would look something like the below, specifying how the operator says they will use the content they access:
Forwarded: for="openai";use="reference"
This also lines up the incentive model we want to foster. Losing trusted status across the more than 20% of web domains that sit behind Cloudflare is a deterrent with teeth. Trust becomes something you can carry with you, and something you can lose.
However, as bot traffic blends with human traffic, it’s possible that this system of transitive trust doesn’t carry beyond the users who can afford to be identifiable. The measures we are proposing today help to convey trust, but they won’t fit the entire web for all time. Small sources of traffic need privacy, and companies that want to preserve their own privacy commitments should be able to explore fair building blocks for the future of an agentic Internet, such as private rate limiting.
These are small changes that move in the same direction: site owners get more control over who uses their content, and how. We believe the new defaults we discussed today and will soon implement are ones that encourage transparency and are more reflective of where the world is going.
Of course, the ebbs and flows of the web will continue shifting under us, and we'll keep adjusting with it. But the direction won't change, because it's the one Cloudflare started with: a web ecosystem built around trust. Where the people who make things can decide how they're used — and one where being honest about what you do earns you more access, not less.
These new options to manage AI traffic are live now, and can be configured by all existing customers in their zone Settings. Not on Cloudflare yet? Start for free to set the traffic controls that you want today.
Happy Content Independence Day.