Part of the Agent Readiness course — the web standards that decide whether an AI agent can read, understand, and act on your site. Measure any page with the Core Agent Vitals analyzer.
What it is #
robots.txt
is a plain-text file at your site root (/robots.txt
) that tells automated clients which paths they may fetch. It's been the crawler contract for search engines for 30 years. What changed: the clients now include AI crawlers — GPTBot
, ClaudeBot
, PerplexityBot
, Google-Extended
, CCBot
, and others — that gather the content models cite when a user asks about your product, docs, or brand.
Why agents need it #
An AI crawler reads robots.txt
before it fetches anything else. If your rules disallow it, it leaves — and your content never enters the corpus the model draws on. The failure is silent: no error, no warning, just absence. You don't rank zero; you don't exist in the answer.
Two common ways this happens by accident:
- A blanket
Disallow: /
left over from a staging config. - An allowlist written for
Googlebot
that never added the AI user-agents, so they fall through to a restrictive*
rule.
Getting this right is the cheapest, highest-leverage agent-readiness fix there is.
How to implement #
Allow reputable AI crawlers on public content, block only what's genuinely private, and point them at your sitemap:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /cart/
Disallow: /account/
Sitemap: https://your-site.com/sitemap.xml
Decide deliberately whether you want to be in training/answer corpora. Blocking GPTBot
is a valid business choice — just make it a choice, not an accident.
Validate #
curl -s https://your-site.com/robots.txt
Confirm the AI user-agents you care about are allowed and no stray Disallow: /
applies to them. The Core Agent Vitals analyzer runs this check under Agent Discoverability — it parses your rules and flags any major AI bot that's blocked from public content.
Common mistakes #
Treating robots.txt as security. It's an advisory. Well-behaved bots honor it; nothing enforces it. Never put "secret" URLs behind aDisallow
— you're just publishing their location.A stale The single most common cause of total agent invisibility. Check it whenever you promote to a new environment.Disallow: /
.Allowlisting only New AI user-agents ship constantly. Either allowGooglebot
.*
for public content or keep the named-bot list current.Blocking your own assets. Disallowing/js/
or/api/
can stop a rendering crawler from seeing content that only appears after those load.No robots.txt is the canonical place to advertise your sitemap — omitting it makes agents work harder to find your deep pages (next lesson).Sitemap:
line.
Next: Sitemaps for Agent Discovery — the table of contents that gets your deep pages into agent answers.