{"slug": "llms-txt-vs-robots-txt-vs-ai-txt-the-developer-s-cheat-sheet", "title": "llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet", "summary": "Here is a factual summary of the article:\n\nThe article clarifies the distinct purposes of three files used to manage AI crawler access to websites: **robots.txt** controls which pages crawlers like Googlebot and GPTBot can access; **llms.txt** provides structured context and documentation for AI models like ClaudeBot to understand a site's content; and **ai.txt** is a newer proposal for real-time AI assistant reading. The author recommends using all three for complete coverage, but notes that combining robots.txt with llms.txt provides 90% of the value for most developers today.", "body_md": "# llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet\n\nEvery Next.js developer building a public site in the last 18 months has hit the same wall: you Google \"how to control what AI crawlers read\" and get three different answers pointing at three different files — `robots.txt`\n\n, `llms.txt`\n\n, and `ai.txt`\n\n. They are not the same thing. They do not talk to the same audience. And using the wrong one (or none at all) means AI search engines are either ignoring your content entirely or indexing pages you never intended them to.\n\nThis is the one-stop breakdown I wish existed when I was figuring this out.\n\n## The three files at a glance\n\n`robots.txt` |\n`llms.txt` |\n`ai.txt` |\n|\n|---|---|---|---|\nProposed by |\nMartijn Koster (1994) | Anthropic + community | AI-txt.com initiative |\nPrimary audience |\nWeb crawlers (Googlebot, Bingbot, etc.) | LLM training & AI search crawlers | AI assistants (ChatGPT, Claude, Gemini) |\nFormat |\nKey-value directives | Markdown / structured text | Key-value + JSON blocks |\nSpec status |\nRFC standard, universally supported | Emerging, growing adoption | Early proposal, limited adoption |\nEnforced by |\nAll major search engines | Anthropic, Perplexity, some others | No major enforcer yet |\nLocation |\n`yourdomain.com/robots.txt` |\n`yourdomain.com/llms.txt` |\n`yourdomain.com/ai.txt` |\n\nThe short mental model: `robots.txt`\n\nis for Googlebot. `llms.txt`\n\nis for ClaudeBot and GPTBot when they are building knowledge, not just indexing. `ai.txt`\n\nis a newer proposal that tries to cover AI assistants reading your site in real time. Use all three if you want complete coverage — but `robots.txt`\n\n+ `llms.txt`\n\nis where you get 90% of the value today.\n\n## robots.txt — the original gatekeeper\n\n`robots.txt`\n\nhas been around since 1994. Every crawler on the internet — Googlebot, Bingbot, DuckDuckBot, GPTBot, ClaudeBot, PerplexityBot — checks it before crawling. If you block `GPTBot`\n\nin `robots.txt`\n\n, it will not crawl your site for training data or AI-search indexing.\n\n**Basic syntax:**\n\n```\nUser-agent: *\nDisallow: /admin/\nDisallow: /private/\n\nUser-agent: GPTBot\nDisallow: /\n\nUser-agent: ClaudeBot\nAllow: /blog/\nDisallow: /\n```\n\n`User-agent: *`\n\napplies to all crawlers. Named user-agents override `*`\n\nfor that specific bot. `Allow`\n\nand `Disallow`\n\nare path-based — no wildcards by default in the original spec, though most modern crawlers support them.\n\n**In Next.js App Router** — generate it dynamically from `app/robots.ts`\n\n:\n\n``` js\n// app/robots.ts\nimport { MetadataRoute } from 'next'\n\nexport default function robots(): MetadataRoute.Robots {\n  return {\n    rules: [\n      {\n        userAgent: '*',\n        allow: '/',\n        disallow: ['/admin/', '/api/private/'],\n      },\n      {\n        userAgent: 'GPTBot',\n        allow: ['/blog/', '/services/'],\n        disallow: ['/'],\n      },\n    ],\n    sitemap: 'https://yourdomain.com/sitemap.xml',\n  }\n}\n```\n\nNext.js renders this as `text/plain`\n\nat `/robots.txt`\n\nautomatically. No separate file needed.\n\n**What robots.txt does NOT do:** it does not stop a crawler from reading pages it already knows about from other sources (backlinks, sitemaps). It only stops it from *actively crawling* those paths. If GPTBot found your `/admin/`\n\npage linked from a public page, it may already have cached it.\n\n## llms.txt — built for the AI era\n\n`llms.txt`\n\nwas proposed by Answer.AI and picked up by Anthropic, Perplexity, and others as a structured way to tell LLMs *what your site is actually about* — not just what they can crawl, but what context they should carry when reasoning about your content.\n\nUnlike `robots.txt`\n\nwhich is access control, `llms.txt`\n\nis documentation. Think of it as a README for your site aimed at language models.\n\n**Basic structure:**\n\n```\n# YourSite\n\n> One-line description of what this site is and who it's for.\n\nA few sentences of context. What does this site do? Who is the author?\nWhat should an LLM understand before citing any page from this domain?\n\n## Blog\n\n- [Post title one](https://yourdomain.com/blog/post-one): One-line summary.\n- [Post title two](https://yourdomain.com/blog/post-two): One-line summary.\n\n## Services\n\n- [Service name](https://yourdomain.com/services/name): What this service does in one line.\n\n## Contact\n\n- Author: Your Name\n- Email: you@yourdomain.com\n- LinkedIn: linkedin.com/in/yourhandle\n```\n\nThe format is intentionally plain Markdown. No special parser needed — any LLM can read it. The `/llms.txt`\n\npath is the convention; some sites also serve `/llms-full.txt`\n\nwith deeper content for models that want more context.\n\n**In Next.js** — generate dynamically from `app/llms.txt/route.ts`\n\n:\n\n``` js\n// app/llms.txt/route.ts\nimport { blogPosts } from '@/data/blog-posts'\n\nexport async function GET() {\n  const lines = [\n    '# YourSite',\n    '',\n    '> AI-search-ready Next.js development and SEO consulting.',\n    '',\n    'This site covers AI engineering, GEO/AEO, and production Next.js patterns.',\n    '',\n    '## Blog',\n    '',\n    ...blogPosts.map(p => `- [${p.title}](https://yourdomain.com/blog/${p.slug}): ${p.excerpt}`),\n    '',\n    '## Services',\n    '',\n    '- [AI-Search Consulting](https://yourdomain.com/services): End-to-end GEO and AEO for Next.js sites.',\n  ]\n\n  return new Response(lines.join('\\n'), {\n    headers: { 'Content-Type': 'text/plain; charset=utf-8' },\n  })\n}\n```\n\nThis keeps `llms.txt`\n\nin sync with your actual content automatically — no manual updates.\n\n**Who reads llms.txt today:** ClaudeBot (Anthropic's crawler), PerplexityBot, some versions of GPTBot. Adoption is growing fast. If you are publishing content you want AI search engines to cite accurately, this file is non-negotiable.\n\n## ai.txt — the wildcard\n\n`ai.txt`\n\nis a newer proposal from a different working group. Where `llms.txt`\n\nfocuses on what an LLM should *know* about your site, `ai.txt`\n\nfocuses on granting or denying *permission* for AI assistants to use your content in responses.\n\n**Basic syntax:**\n\n```\n# ai.txt\nVersion: 1.0\n\n[permissions]\nallow: true\ncommercial-use: false\ntraining: false\nreal-time-access: true\n\n[attribution]\nrequire: true\nformat: \"Source: {title} ({url})\"\n\n[contact]\nemail: you@yourdomain.com\n```\n\nHonest assessment: `ai.txt`\n\nhas minimal enforcer support right now. No major AI company officially reads it. The spec is still evolving. That said, if the initiative gains traction (similar to how `robots.txt`\n\nwent from informal convention to de-facto standard), having it early costs nothing and signals intent.\n\nFor most developers today: add it, keep it simple, and do not spend more than 10 minutes on it.\n\n## How a crawler actually decides what to read\n\nThe flow for a modern AI crawler like GPTBot or ClaudeBot hitting your domain:\n\n- Fetch\n`robots.txt`\n\n— am I allowed to crawl this path? - If allowed, fetch the page HTML\n- Fetch\n`llms.txt`\n\n(periodically, not per-request) — what is this site actually about? - Check\n`ai.txt`\n\nif the implementation supports it - Index the content with the context from steps 2–4 combined\n\nThe key insight: `robots.txt`\n\nis checked *per crawl request*. `llms.txt`\n\nis fetched periodically and cached — it shapes how the model understands your whole site over time, not just whether it can read one page.\n\n## Putting it all together in Next.js App Router\n\nHere is the complete implementation for a Next.js 15 App Router site:\n\n```\napp/\n├── robots.ts          ← generates /robots.txt\n├── sitemap.ts         ← generates /sitemap.xml\n├── llms.txt/\n│   └── route.ts       ← generates /llms.txt (dynamic, always in sync)\npublic/\n└── ai.txt             ← static file, update manually\n```\n\n**robots.ts** (full version with AI crawler rules):\n\n``` js\n// app/robots.ts\nimport { MetadataRoute } from 'next'\n\nconst BASE_URL = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://yourdomain.com'\n\nexport default function robots(): MetadataRoute.Robots {\n  return {\n    rules: [\n      // Default: allow everything\n      { userAgent: '*', allow: '/' },\n\n      // GPTBot: allow blog and services, block everything else\n      {\n        userAgent: 'GPTBot',\n        allow: ['/blog/', '/services/', '/tools/'],\n        disallow: ['/admin/', '/api/'],\n      },\n\n      // ClaudeBot: same rules\n      {\n        userAgent: 'ClaudeBot',\n        allow: ['/blog/', '/services/', '/tools/'],\n        disallow: ['/admin/', '/api/'],\n      },\n\n      // PerplexityBot: full access (it drives meaningful referral traffic)\n      { userAgent: 'PerplexityBot', allow: '/' },\n\n      // Google-Extended (used for Gemini training): restrict to blog only\n      {\n        userAgent: 'Google-Extended',\n        allow: ['/blog/'],\n        disallow: ['/'],\n      },\n    ],\n    sitemap: `${BASE_URL}/sitemap.xml`,\n  }\n}\n```\n\n**llms.txt/route.ts** (dynamic, pulls from your data layer):\n\n``` js\n// app/llms.txt/route.ts\nimport { blogPosts } from '@/data/blog-posts'\nimport { servicePosts } from '@/data/services'\n\nconst BASE_URL = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://yourdomain.com'\n\nexport async function GET() {\n  const content = `# YourSite\n\n> [One-sentence description of your site and its purpose.]\n\n[Two or three sentences giving an LLM the context it needs to cite your site accurately.\nWhat topics do you cover? Who is the author? What makes this site's perspective unique?]\n\n## Blog\n\n${blogPosts.map(p => `- [${p.title}](${BASE_URL}/blog/${p.slug}): ${p.excerpt}`).join('\\n')}\n\n## Services\n\n${servicePosts.map(s => `- [${s.title}](${BASE_URL}/services/${s.slug}): ${s.summary}`).join('\\n')}\n\n## Author\n\n- Name: Your Name\n- Site: ${BASE_URL}\n- Expertise: [Your primary expertise areas]\n`\n\n  return new Response(content, {\n    headers: {\n      'Content-Type': 'text/plain; charset=utf-8',\n      'Cache-Control': 'public, max-age=3600, stale-while-revalidate=86400',\n    },\n  })\n}\n```\n\n**public/ai.txt** (static, update when the spec stabilizes):\n\n```\n# ai.txt\nVersion: 1.0\n\n[permissions]\nallow: true\ncommercial-use: false\ntraining: false\nreal-time-access: true\n\n[attribution]\nrequire: true\n\n[contact]\nsite: https://yourdomain.com\n```\n\n## Which file do you actually need?\n\n**Start here:**\n\n- Building a public Next.js site? → Add\n`robots.txt`\n\nfirst. Always. - Want AI search engines (ChatGPT, Perplexity, Claude) to cite your content accurately? → Add\n`llms.txt`\n\n. This is the highest-leverage file for AI-search visibility right now. - Want to future-proof against the\n`ai.txt`\n\nspec gaining enforcement? → Add a simple`public/ai.txt`\n\n. It costs 10 lines. - Want to block AI crawlers from training on your content? → Set\n`Disallow: /`\n\nfor`GPTBot`\n\n,`ClaudeBot`\n\n, and`Google-Extended`\n\nin`robots.txt`\n\n. This is the only file that actually enforces it today.\n\nThe one mistake I see most often: developers add `robots.txt`\n\nbut skip `llms.txt`\n\n, then wonder why ChatGPT gives wrong answers about their site even though Googlebot indexes it fine. Googlebot and GPTBot-for-knowledge are completely different crawlers with different purposes.\n\n## Three things to verify right now\n\nOpen your terminal and check these three URLs on your live site:\n\n```\ncurl https://yourdomain.com/robots.txt\ncurl https://yourdomain.com/llms.txt\ncurl https://yourdomain.com/ai.txt\n```\n\nIf any returns a 404 or HTML error page, that file is missing. For `robots.txt`\n\n, a missing file means all crawlers assume full access — usually fine for public sites, but you lose granular control. For `llms.txt`\n\n, missing means LLMs are forming their understanding of your site from raw page HTML with no structured context — which almost always leads to inaccurate citations.\n\nIf you want a deeper look at how AI crawlers read Next.js sites specifically — what RSC payloads they fetch, how streaming affects what they see, and which metadata fields they actually use — I have a longer writeup on [the AI-search architecture patterns I use in production](https://mudassirkhan.me/blog) that goes further than this cheat sheet.\n\nAnd if you want this wired up on your own site end-to-end, [that is exactly the kind of work I take on](https://mudassirkhan.me/services).\n\n*If your own llms.txt or robots.txt setup looks different from what I showed here — especially if you are on an older Next.js version or using the Pages Router — drop it in the comments. Curious what variations people are running in production.*", "url": "https://wpnews.pro/news/llms-txt-vs-robots-txt-vs-ai-txt-the-developer-s-cheat-sheet", "canonical_source": "https://dev.to/mudassirworks/llmstxt-vs-robotstxt-vs-aitxt-the-developers-cheat-sheet-1p72", "published_at": "2026-05-23 12:40:02+00:00", "updated_at": "2026-05-23 13:04:05.970587+00:00", "lang": "en", "topics": ["developer-tools", "artificial-intelligence", "large-language-models", "web3", "enterprise-software"], "entities": ["Googlebot", "ClaudeBot", "GPTBot", "Bingbot", "DuckDuckBot", "PerplexityBot", "Next.js", "robots.txt"], "alternates": {"html": "https://wpnews.pro/news/llms-txt-vs-robots-txt-vs-ai-txt-the-developer-s-cheat-sheet", "markdown": "https://wpnews.pro/news/llms-txt-vs-robots-txt-vs-ai-txt-the-developer-s-cheat-sheet.md", "text": "https://wpnews.pro/news/llms-txt-vs-robots-txt-vs-ai-txt-the-developer-s-cheat-sheet.txt", "jsonld": "https://wpnews.pro/news/llms-txt-vs-robots-txt-vs-ai-txt-the-developer-s-cheat-sheet.jsonld"}}