You put a robots.txt on your site to tell search crawlers what to ignore. You add a sitemap.xml to help them find everything. These standards work because crawlers visit your site repeatedly — on a schedule, automatically, indefinitely. Instructions you leave in files become part of an ongoing conversation between your server and the crawler.
llms.txt doesn't work like that. That's the thing most articles about it miss, and it's the reason the standard is simultaneously more limited and more interesting than it sounds.
llms.txt is a proposed standard — not officially adopted by any major AI provider — created by Jeremy Howard from Answer.AI. The idea: place a Markdown file at your domain root (yourdomain.com/llms.txt
) that describes your site and lists your important pages with brief descriptions. Clean, human-readable, structured for AI consumption rather than HTML parsing.
A minimal example:
> Senior developer helping businesses build MVPs, integrate APIs,
> and escape broken projects. Based in Finland, working across Europe.
## Services
- [MVP Development](https://iurii.rogulia.fi/services/mvp-development): End-to-end MVP builds in 6–12 weeks
- [API Integrations](https://iurii.rogulia.fi/services/api-integrations): Connecting third-party services and internal systems
- [Fractional CTO](https://iurii.rogulia.fi/services/fractional-cto): Technical leadership without a full-time hire
## Blog
- [Blog](https://iurii.rogulia.fi/blog): Technical articles on Next.js, Node.js, automation, and architecture
## Contact
- [Contact](https://iurii.rogulia.fi/contact): Project inquiries
The format is intentionally minimal. No schema.org, no JSON-LD, no semantic HTML — just structured Markdown that describes who you are and what matters on your site.
Here's what changes how you should think about llms.txt: AI systems interact with your site in three distinct ways, and llms.txt is potentially relevant to all of them.
Googlebot visits your site on a schedule. It reads robots.txt on every visit. Instructions you add today take effect from today's crawl. The relationship is continuous and ongoing.
Base model training works differently. A training crawl happens at a specific point in time, data is collected, the model is trained. After that — the base model doesn't come back. What it knows about your site is frozen from whenever that crawl happened, and it stays frozen until the next training run, which might be six months or two years later. A file present on your domain at crawl time may be included in training corpora — if it's downloaded, retained, and not filtered out. None of that is guaranteed, but the file costs you nothing to place.
Runtime AI systems are a separate layer. Perplexity, Bing Copilot, and similar systems retrieve web content during inference — they're not frozen snapshots. In practice this often means search API snippets and cached content rather than a full site crawl, but the direction of travel is toward richer context retrieval. If they eventually start parsing llms.txt (none currently does by default), having a structured description means your site context is immediately legible without parsing navigation, sidebars, and boilerplate.
There's also a third category: AI agents — autonomous systems that browse sites to complete tasks. These are crawlers by design, and structured context files are exactly the kind of signal they're built to consume. Of the three scenarios, agents are the most plausible near-term use case for llms.txt; the training data angle is more speculative.
The practical implication: llms.txt could be useful across all three access patterns. None is guaranteed. That's fine — the cost of placing the file is low enough that you don't need high confidence to justify it.
The honest numbers first. In one 30-day log analysis of roughly 1,000 domains, GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot registered zero requests specifically for llms.txt. No major AI provider has officially announced support for the standard. A Google engineer publicly compared it to the meta keywords tag — a standard once considered essential, now completely ignored by every major search engine.
The spec has no RFC. No formal adoption process. It's a community proposal that gained momentum because it arrived at the right moment — when everyone is rethinking AI visibility — not because there's a concrete implementation roadmap with committed parties.
This is the part that most llms.txt evangelism skips. Current support is essentially zero.
The argument for adding llms.txt isn't that it works today. It's about asymmetric cost and benefit:
The effort is fifteen minutes. Create a Markdown file, write a clear description of your site, list your important pages. Done. No server configuration, no deployment scripts, no ongoing maintenance. You write it once.
The downside is minimal. A public Markdown file is unlikely to harm performance, crawl budget, Core Web Vitals, or existing SEO signals. The main risk is publishing inaccurate, overpromising, or strategically sensitive information — which is a content problem, not a format problem.
The upside could be years of compounding benefit. If any major AI provider adopts the standard and begins parsing llms.txt during training crawls, your structured description is already there — placed years before your competitors thought to add it. If AI-powered web agents (which do crawl sites actively, not just at training time) start reading it for context, you're already covered.
Some optional standards eventually matter — but only after platforms commit to them. Canonical tags had Google's explicit support from day one. JSON-LD got traction because search engines documented exactly how they used it. llms.txt has no committed consumer yet. The analogy is aspirational, not predictive.
What it does share with those standards: low cost of early adoption. The question isn't whether llms.txt will succeed — it's whether the cost of betting on it justifies the potential upside. Low probability. High upside if it lands. Near-zero cost regardless.
In Next.js, create public/llms.txt
— it's automatically served at /llms.txt
:
> One sentence: what you do and who you serve.
## Core Pages
- [Page Title](https://iurii.rogulia.fi/path): What this page is for
- [Page Title](https://iurii.rogulia.fi/path): What this page is for
## Key Content
- [Section](https://iurii.rogulia.fi/path): What readers find here
## Contact
- [Contact](https://iurii.rogulia.fi/contact): How to reach you
Guidelines that actually matter:
An extended variant, llms-full.txt
, can contain your full documentation or page content as raw text — for AI systems that want complete context rather than a structured index. Link to it from llms.txt
if you create it:
> Description.
## Full Content
- [Complete site content](https://iurii.rogulia.fi/llms-full.txt): Full text of all pages for AI systems
## Key Pages
- ...
Verify it's accessible after deploying:
curl -I https://yourdomain.com/llms.txt
slug="seo-audit"
text="Want a review of your structured data, llms.txt, robots.txt, and AI discoverability setup? Fixed-fee Technical SEO Audit — schema, sitemap, hreflang, indexing, Core Web Vitals, broken links, and per-market keyword research. Written report in 5 working days."
/>
Since llms.txt support is minimal right now, be clear about what genuinely affects how AI systems represent your site:
<article>
, <section>
, proper heading hierarchy makes content parseable without JS executionOne more honest note: modern AI systems already extract structure from HTML reasonably well. <article>
, <h1>
, <nav>
— these give models enough signal to understand what a page is about without a dedicated hints file. llms.txt is an optimization for cases where clean structure matters and parsing overhead is worth reducing. It's not a necessity, and it won't compensate for thin content or missing structured data.
llms.txt complements the list above. It doesn't replace any of it, and it belongs last in the priority order — not first.
No confirmed adoption from major AI providers. Zero observed crawl requests in real data. A Google engineer dismissed it. No formal backing.
Add it anyway. Fifteen minutes, zero ongoing cost, low probability of payoff, high upside if it lands. That math works even with pessimistic assumptions.
The deeper point is about how AI and search crawlers differ. robots.txt is an ongoing instruction to a system that visits on a schedule. llms.txt spans three access patterns: base model training (snapshot, frozen until next cycle), live inference systems (Perplexity, Copilot, RAG pipelines), and autonomous agents that browse by design. It doesn't behave like robots.txt or sitemap.xml — it's a structured introduction that can be consumed once and relied on later, or referenced repeatedly by runtime systems. Write it accordingly: stable, accurate, and worth being "frozen in time" if a model or system sees it once and relies on it later.
If you're building technical infrastructure that needs to stay visible across both traditional search and emerging AI systems — structured data, semantic HTML, proper crawl architecture — that's part of the work I do on every project. I covered the server-side foundations in the pikkuna.fi build: 30 languages, correct canonical tags, and structured data that survives localization. For MVP development or a Technical SEO Audit of your discoverability stack — get in touch if you need a developer who builds this in by default rather than bolts it on at the end.
Further reading: