Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?
I built a Sitemap Content Extractor that does exactly this — feed it a sitemap.xml
URL and it crawls every page, extracting structured content.
.xml.gz
files work out of the boxYou can run it directly on Apify Store — no setup required.
Just provide:
https://example.com/sitemap.xml
)
{
"url": "https://pydantic.dev/docs/",
"title": "Pydantic Docs - Validation, AI Agents, Logfire Observability",
"content": "Full extracted article text...",
"wordCount": 131,
"metaDescription": "Pydantic documentation...",
"h1Headings": ["Pydantic Docs"],
"lastmod": "2025-01-15",
"extractedAt": "2026-07-04T10:45:00Z"
}
Crawl your entire site and identify pages with:
lastmod
dates)Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.
Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.
Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.
The extractor is built in Python 3.12 and uses:
It handles both <urlset>
(regular sitemaps) and <sitemapindex>
(nested sitemaps), following child sitemaps recursively.
Try it now on the Apify Store
No registration needed — just paste a sitemap URL and hit run.
What would you use a sitemap extractor for? Let me know in the comments!