llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet

Here is a factual summary of the article:

The article clarifies the distinct purposes of three files used to manage AI crawler access to websites: **robots.txt** controls which pages crawlers like Googlebot and GPTBot can access; **llms.txt** provides structured context and documentation for AI models like ClaudeBot to understand a site's content; and **ai.txt** is a newer proposal for real-time AI assistant reading. The author recommends using all three for complete coverage, but notes that combining robots.txt with llms.txt provides 90% of the value for most developers today.

llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet Every Next.js developer building a public site in the last 18 months has hit the same wall: you Google "how to control what AI crawlers read" and get three different answers pointing at three different files — robots.txt , llms.txt , and ai.txt . They are not the same thing. They do not talk to the same audience. And using the wrong one or none at all means AI search engines are either ignoring your content entirely or indexing pages you never intended them to. This is the one-stop breakdown I wish existed when I was figuring this out. The three files at a glance robots.txt | llms.txt | ai.txt | | |---|---|---|---| Proposed by | Martijn Koster 1994 | Anthropic + community | AI-txt.com initiative | Primary audience | Web crawlers Googlebot, Bingbot, etc. | LLM training & AI search crawlers | AI assistants ChatGPT, Claude, Gemini | Format | Key-value directives | Markdown / structured text | Key-value + JSON blocks | Spec status | RFC standard, universally supported | Emerging, growing adoption | Early proposal, limited adoption | Enforced by | All major search engines | Anthropic, Perplexity, some others | No major enforcer yet | Location | yourdomain.com/robots.txt | yourdomain.com/llms.txt | yourdomain.com/ai.txt | The short mental model: robots.txt is for Googlebot. llms.txt is for ClaudeBot and GPTBot when they are building knowledge, not just indexing. ai.txt is a newer proposal that tries to cover AI assistants reading your site in real time. Use all three if you want complete coverage — but robots.txt + llms.txt is where you get 90% of the value today. robots.txt — the original gatekeeper robots.txt has been around since 1994. Every crawler on the internet — Googlebot, Bingbot, DuckDuckBot, GPTBot, ClaudeBot, PerplexityBot — checks it before crawling. If you block GPTBot in robots.txt , it will not crawl your site for training data or AI-search indexing. Basic syntax: User-agent: Disallow: /admin/ Disallow: /private/ User-agent: GPTBot Disallow: / User-agent: ClaudeBot Allow: /blog/ Disallow: / User-agent: applies to all crawlers. Named user-agents override for that specific bot. Allow and Disallow are path-based — no wildcards by default in the original spec, though most modern crawlers support them. In Next.js App Router — generate it dynamically from app/robots.ts : js // app/robots.ts import { MetadataRoute } from 'next' export default function robots : MetadataRoute.Robots { return { rules: { userAgent: ' ', allow: '/', disallow: '/admin/', '/api/private/' , }, { userAgent: 'GPTBot', allow: '/blog/', '/services/' , disallow: '/' , }, , sitemap: 'https://yourdomain.com/sitemap.xml', } } Next.js renders this as text/plain at /robots.txt automatically. No separate file needed. What robots.txt does NOT do: it does not stop a crawler from reading pages it already knows about from other sources backlinks, sitemaps . It only stops it from actively crawling those paths. If GPTBot found your /admin/ page linked from a public page, it may already have cached it. llms.txt — built for the AI era llms.txt was proposed by Answer.AI and picked up by Anthropic, Perplexity, and others as a structured way to tell LLMs what your site is actually about — not just what they can crawl, but what context they should carry when reasoning about your content. Unlike robots.txt which is access control, llms.txt is documentation. Think of it as a README for your site aimed at language models. Basic structure: YourSite One-line description of what this site is and who it's for. A few sentences of context. What does this site do? Who is the author? What should an LLM understand before citing any page from this domain? Blog - Post title one https://yourdomain.com/blog/post-one : One-line summary. - Post title two https://yourdomain.com/blog/post-two : One-line summary. Services - Service name https://yourdomain.com/services/name : What this service does in one line. Contact - Author: Your Name - Email: you@yourdomain.com - LinkedIn: linkedin.com/in/yourhandle The format is intentionally plain Markdown. No special parser needed — any LLM can read it. The /llms.txt path is the convention; some sites also serve /llms-full.txt with deeper content for models that want more context. In Next.js — generate dynamically from app/llms.txt/route.ts : js // app/llms.txt/route.ts import { blogPosts } from '@/data/blog-posts' export async function GET { const lines = ' YourSite', '', ' AI-search-ready Next.js development and SEO consulting.', '', 'This site covers AI engineering, GEO/AEO, and production Next.js patterns.', '', ' Blog', '', ...blogPosts.map p = - ${p.title} https://yourdomain.com/blog/${p.slug} : ${p.excerpt} , '', ' Services', '', '- AI-Search Consulting https://yourdomain.com/services : End-to-end GEO and AEO for Next.js sites.', return new Response lines.join '\n' , { headers: { 'Content-Type': 'text/plain; charset=utf-8' }, } } This keeps llms.txt in sync with your actual content automatically — no manual updates. Who reads llms.txt today: ClaudeBot Anthropic's crawler , PerplexityBot, some versions of GPTBot. Adoption is growing fast. If you are publishing content you want AI search engines to cite accurately, this file is non-negotiable. ai.txt — the wildcard ai.txt is a newer proposal from a different working group. Where llms.txt focuses on what an LLM should know about your site, ai.txt focuses on granting or denying permission for AI assistants to use your content in responses. Basic syntax: ai.txt Version: 1.0 permissions allow: true commercial-use: false training: false real-time-access: true attribution require: true format: "Source: {title} {url} " contact email: you@yourdomain.com Honest assessment: ai.txt has minimal enforcer support right now. No major AI company officially reads it. The spec is still evolving. That said, if the initiative gains traction similar to how robots.txt went from informal convention to de-facto standard , having it early costs nothing and signals intent. For most developers today: add it, keep it simple, and do not spend more than 10 minutes on it. How a crawler actually decides what to read The flow for a modern AI crawler like GPTBot or ClaudeBot hitting your domain: - Fetch robots.txt — am I allowed to crawl this path? - If allowed, fetch the page HTML - Fetch llms.txt periodically, not per-request — what is this site actually about? - Check ai.txt if the implementation supports it - Index the content with the context from steps 2–4 combined The key insight: robots.txt is checked per crawl request . llms.txt is fetched periodically and cached — it shapes how the model understands your whole site over time, not just whether it can read one page. Putting it all together in Next.js App Router Here is the complete implementation for a Next.js 15 App Router site: app/ ├── robots.ts ← generates /robots.txt ├── sitemap.ts ← generates /sitemap.xml ├── llms.txt/ │ └── route.ts ← generates /llms.txt dynamic, always in sync public/ └── ai.txt ← static file, update manually robots.ts full version with AI crawler rules : js // app/robots.ts import { MetadataRoute } from 'next' const BASE URL = process.env.NEXT PUBLIC SITE URL ?? 'https://yourdomain.com' export default function robots : MetadataRoute.Robots { return { rules: // Default: allow everything { userAgent: ' ', allow: '/' }, // GPTBot: allow blog and services, block everything else { userAgent: 'GPTBot', allow: '/blog/', '/services/', '/tools/' , disallow: '/admin/', '/api/' , }, // ClaudeBot: same rules { userAgent: 'ClaudeBot', allow: '/blog/', '/services/', '/tools/' , disallow: '/admin/', '/api/' , }, // PerplexityBot: full access it drives meaningful referral traffic { userAgent: 'PerplexityBot', allow: '/' }, // Google-Extended used for Gemini training : restrict to blog only { userAgent: 'Google-Extended', allow: '/blog/' , disallow: '/' , }, , sitemap: ${BASE URL}/sitemap.xml , } } llms.txt/route.ts dynamic, pulls from your data layer : js // app/llms.txt/route.ts import { blogPosts } from '@/data/blog-posts' import { servicePosts } from '@/data/services' const BASE URL = process.env.NEXT PUBLIC SITE URL ?? 'https://yourdomain.com' export async function GET { const content = YourSite One-sentence description of your site and its purpose. Two or three sentences giving an LLM the context it needs to cite your site accurately. What topics do you cover? Who is the author? What makes this site's perspective unique? Blog ${blogPosts.map p = - ${p.title} ${BASE URL}/blog/${p.slug} : ${p.excerpt} .join '\n' } Services ${servicePosts.map s = - ${s.title} ${BASE URL}/services/${s.slug} : ${s.summary} .join '\n' } Author - Name: Your Name - Site: ${BASE URL} - Expertise: Your primary expertise areas return new Response content, { headers: { 'Content-Type': 'text/plain; charset=utf-8', 'Cache-Control': 'public, max-age=3600, stale-while-revalidate=86400', }, } } public/ai.txt static, update when the spec stabilizes : ai.txt Version: 1.0 permissions allow: true commercial-use: false training: false real-time-access: true attribution require: true contact site: https://yourdomain.com Which file do you actually need? Start here: - Building a public Next.js site? → Add robots.txt first. Always. - Want AI search engines ChatGPT, Perplexity, Claude to cite your content accurately? → Add llms.txt . This is the highest-leverage file for AI-search visibility right now. - Want to future-proof against the ai.txt spec gaining enforcement? → Add a simple public/ai.txt . It costs 10 lines. - Want to block AI crawlers from training on your content? → Set Disallow: / for GPTBot , ClaudeBot , and Google-Extended in robots.txt . This is the only file that actually enforces it today. The one mistake I see most often: developers add robots.txt but skip llms.txt , then wonder why ChatGPT gives wrong answers about their site even though Googlebot indexes it fine. Googlebot and GPTBot-for-knowledge are completely different crawlers with different purposes. Three things to verify right now Open your terminal and check these three URLs on your live site: curl https://yourdomain.com/robots.txt curl https://yourdomain.com/llms.txt curl https://yourdomain.com/ai.txt If any returns a 404 or HTML error page, that file is missing. For robots.txt , a missing file means all crawlers assume full access — usually fine for public sites, but you lose granular control. For llms.txt , missing means LLMs are forming their understanding of your site from raw page HTML with no structured context — which almost always leads to inaccurate citations. If you want a deeper look at how AI crawlers read Next.js sites specifically — what RSC payloads they fetch, how streaming affects what they see, and which metadata fields they actually use — I have a longer writeup on the AI-search architecture patterns I use in production https://mudassirkhan.me/blog that goes further than this cheat sheet. And if you want this wired up on your own site end-to-end, that is exactly the kind of work I take on https://mudassirkhan.me/services . If your own llms.txt or robots.txt setup looks different from what I showed here — especially if you are on an older Next.js version or using the Pages Router — drop it in the comments. Curious what variations people are running in production.