# AI-Aware robots.txt: Let the Right Agents In

> Source: <https://blog.r-lopes.com/posts/agent-readiness-robots-txt>
> Published: 2026-07-02 14:00:00+00:00

*Part of the Agent Readiness course — the web standards that decide whether an AI agent can read, understand, and act on your site. Measure any page with the Core Agent Vitals analyzer.*

## What it is

`robots.txt`

is a plain-text file at your site root (`/robots.txt`

) that tells automated clients which paths they may fetch. It's been the crawler contract for search engines for 30 years. What changed: the clients now include **AI crawlers** — `GPTBot`

, `ClaudeBot`

, `PerplexityBot`

, `Google-Extended`

, `CCBot`

, and others — that gather the content models cite when a user asks about your product, docs, or brand.

## Why agents need it

An AI crawler reads `robots.txt`

**before** it fetches anything else. If your rules disallow it, it leaves — and your content never enters the corpus the model draws on. The failure is silent: no error, no warning, just absence. You don't rank zero; you don't exist in the answer.

Two common ways this happens by accident:

- A blanket
`Disallow: /`

left over from a staging config. - An allowlist written for
`Googlebot`

that never added the AI user-agents, so they fall through to a restrictive`*`

rule.

Getting this right is the cheapest, highest-leverage agent-readiness fix there is.

## How to implement

Allow reputable AI crawlers on public content, block only what's genuinely private, and point them at your sitemap:

```
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

# Everyone else: public content ok, keep private areas out
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /cart/
Disallow: /account/

Sitemap: https://your-site.com/sitemap.xml
```

Decide deliberately whether you *want* to be in training/answer corpora. Blocking `GPTBot`

is a valid business choice — just make it a choice, not an accident.

## Validate

```
curl -s https://your-site.com/robots.txt
```

Confirm the AI user-agents you care about are allowed and no stray `Disallow: /`

applies to them. The [Core Agent Vitals analyzer](https://agentvitals.dev/analyze) runs this check under **Agent Discoverability** — it parses your rules and flags any major AI bot that's blocked from public content.

## Common mistakes

**Treating robots.txt as security.** It's an advisory. Well-behaved bots honor it; nothing enforces it. Never put "secret" URLs behind a`Disallow`

— you're just publishing their location.**A stale** The single most common cause of total agent invisibility. Check it whenever you promote to a new environment.`Disallow: /`

.**Allowlisting only** New AI user-agents ship constantly. Either allow`Googlebot`

.`*`

for public content or keep the named-bot list current.**Blocking your own assets.** Disallowing`/js/`

or`/api/`

can stop a rendering crawler from seeing content that only appears after those load.**No** robots.txt is the canonical place to advertise your sitemap — omitting it makes agents work harder to find your deep pages (next lesson).`Sitemap:`

line.

*Next: Sitemaps for Agent Discovery — the table of contents that gets your deep pages into agent answers.*
