Part of the Agent Readiness course. Measure any page with the Core Agent Vitals analyzer.
What it is #
An XML sitemap (/sitemap.xml
) is a machine-readable list of every public URL on your site, each with an optional <lastmod>
date. It's the standard way to tell crawlers "here is everything worth indexing, and here's when it last changed." The format is defined at sitemaps.org.
Why agents need it #
Agents and crawlers discover pages two ways: by following links, and by reading your sitemap. Link-following alone is shallow — it finds what's reachable from your homepage in a few hops and misses the long tail: individual products, doc pages, pricing tiers, deep articles. Those deep pages are exactly what answer specific user questions.
A sitemap flattens your whole site into one list an agent can consume in a single fetch, and <lastmod>
tells it what changed so it re-fetches the right pages instead of re-crawling everything or nothing. No sitemap = your deep inventory is invisible unless an agent happens to click its way there.
How to implement #
Generate sitemap.xml
at build time from your routes (every major framework and CMS has a plugin), and list real, canonical, public URLs:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://your-site.com/</loc>
<lastmod>2026-07-01</lastmod>
</url>
<url>
<loc>https://your-site.com/docs/quickstart</loc>
<lastmod>2026-06-28</lastmod>
</url>
</urlset>
For large sites (>50,000 URLs or >50 MB), split into multiple sitemaps and reference them from a sitemap_index.xml
. Then advertise it in robots.txt
:
Sitemap: https://your-site.com/sitemap.xml
Validate #
curl -s https://your-site.com/sitemap.xml | head -20
Confirm valid XML, real <loc>
entries, and recent <lastmod>
values. The Core Agent Vitals analyzer checks for the sitemap at /sitemap.xml
and /sitemap_index.xml
, validates it has URL entries, and flags a stale one.
Common mistakes #
No sitemap at all. The default for many hand-built sites — and a silent cap on how much of you agents can find.Faked Setting every page's lastmod to today (or build time) trains crawlers to ignore the signal. Emit thelastmod
.realcontent-change date.Listing non-canonical or redirecting URLs. Every<loc>
should be a 200, canonical, indexable URL — not a redirect, not anoindex
page.Forgetting the robots.txt reference. Without theSitemap:
line, agents have to guess the location.Letting it drift. A sitemap generated once and never regenerated slowly diverges from reality. Build it in your pipeline so it can't rot.
Next: JSON-LD Structured Data — telling agents what a page is, not just what links to it.