# How to Extract Clean Content From Any Website Sitemap (For SEO Audits & AI Training)

> Source: <https://dev.to/oaida_adrian_afa2428f63d0/how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training-15a9>
> Published: 2026-07-04 10:50:51+00:00

Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?

I built a **Sitemap Content Extractor** that does exactly this — feed it a `sitemap.xml`

URL and it crawls every page, extracting structured content.

`.xml.gz`

files work out of the boxYou can run it directly on [Apify Store](https://apify.com/darknezz/sitemap-content-extractor) — no setup required.

Just provide:

`https://example.com/sitemap.xml`

)

```
{
  "url": "https://pydantic.dev/docs/",
  "title": "Pydantic Docs - Validation, AI Agents, Logfire Observability",
  "content": "Full extracted article text...",
  "wordCount": 131,
  "metaDescription": "Pydantic documentation...",
  "h1Headings": ["Pydantic Docs"],
  "lastmod": "2025-01-15",
  "extractedAt": "2026-07-04T10:45:00Z"
}
```

Crawl your entire site and identify pages with:

`lastmod`

dates)Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.

Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.

Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.

The extractor is built in Python 3.12 and uses:

It handles both `<urlset>`

(regular sitemaps) and `<sitemapindex>`

(nested sitemaps), following child sitemaps recursively.

Try it now on the [Apify Store](https://apify.com/darknezz/sitemap-content-extractor)

No registration needed — just paste a sitemap URL and hit run.

*What would you use a sitemap extractor for? Let me know in the comments!*
