cd /news/developer-tools/how-to-extract-clean-content-from-an… · home topics developer-tools article
[ARTICLE · art-47837] src=dev.to ↗ pub= topic=developer-tools verified=true sentiment=↑ positive

How to Extract Clean Content From Any Website Sitemap (For SEO Audits & AI Training)

A developer built a Sitemap Content Extractor that crawls every page from a sitemap.xml URL and extracts structured content such as title, meta description, headings, and clean text. The tool, available on Apify Store, handles nested sitemaps and uses trafilatura to remove boilerplate, making it useful for SEO audits, AI training data extraction, and content migration.

read1 min views1 publishedJul 4, 2026

Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?

I built a Sitemap Content Extractor that does exactly this — feed it a sitemap.xml

URL and it crawls every page, extracting structured content.

.xml.gz

files work out of the boxYou can run it directly on Apify Store — no setup required.

Just provide:

https://example.com/sitemap.xml

)

{
  "url": "https://pydantic.dev/docs/",
  "title": "Pydantic Docs - Validation, AI Agents, Logfire Observability",
  "content": "Full extracted article text...",
  "wordCount": 131,
  "metaDescription": "Pydantic documentation...",
  "h1Headings": ["Pydantic Docs"],
  "lastmod": "2025-01-15",
  "extractedAt": "2026-07-04T10:45:00Z"
}

Crawl your entire site and identify pages with:

lastmod

dates)Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.

Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.

Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.

The extractor is built in Python 3.12 and uses:

It handles both <urlset>

(regular sitemaps) and <sitemapindex>

(nested sitemaps), following child sitemaps recursively.

Try it now on the Apify Store

No registration needed — just paste a sitemap URL and hit run.

What would you use a sitemap extractor for? Let me know in the comments!

── more in #developer-tools 4 stories · sorted by recency
── more on @apify 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-to-extract-clean…] indexed:0 read:1min 2026-07-04 ·