How to Extract Clean Content From Any Website Sitemap (For SEO Audits & AI Training)

wpnews.pro

cd /news/developer-tools/how-to-extract-clean-content-from-an… · home › topics › developer-tools › article

[ARTICLE · art-47837] src=dev.to ↗ pub=2026-07-04T10:50Z topic=developer-tools verified=true sentiment=↑ positive

How to Extract Clean Content From Any Website Sitemap (For SEO Audits & AI Training)

A developer built a Sitemap Content Extractor that crawls every page from a sitemap.xml URL and extracts structured content such as title, meta description, headings, and clean text. The tool, available on Apify Store, handles nested sitemaps and uses trafilatura to remove boilerplate, making it useful for SEO audits, AI training data extraction, and content migration.

read1 min views1 publishedJul 4, 2026

Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?

I built a Sitemap Content Extractor that does exactly this — feed it a sitemap.xml

URL and it crawls every page, extracting structured content.

.xml.gz

files work out of the boxYou can run it directly on Apify Store — no setup required.

Just provide:

https://example.com/sitemap.xml

)

{
  "url": "https://pydantic.dev/docs/",
  "title": "Pydantic Docs - Validation, AI Agents, Logfire Observability",
  "content": "Full extracted article text...",
  "wordCount": 131,
  "metaDescription": "Pydantic documentation...",
  "h1Headings": ["Pydantic Docs"],
  "lastmod": "2025-01-15",
  "extractedAt": "2026-07-04T10:45:00Z"
}

Crawl your entire site and identify pages with:

lastmod

dates)Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.

Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.

Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.

The extractor is built in Python 3.12 and uses:

It handles both <urlset>

(regular sitemaps) and <sitemapindex>

(nested sitemaps), following child sitemaps recursively.

Try it now on the Apify Store

No registration needed — just paste a sitemap URL and hit run.

What would you use a sitemap extractor for? Let me know in the comments!

source & further reading

dev.to — original article Prompt Caching in Practice: The 5-Minute Cache and Workflow Design Solving the GPU Pinning Saga and Gemma's Meta-Commentary Make Any Website AI-Readable: Generating llms.txt Files with Python

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-to-extract-clean-con…

Read original on dev.to → dev.to/oaida_adrian_afa2428f63d0/how-to-extract-…

mentioned entities

Apify

trafilatura

Pydantic

Python

metadata

slughow-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training

topic#developer-tools

secondary2 topics

sentimentpositive

canonicaldev.to

navigation

← prevInside Infineon’s €5B Dresden Fa…

next →What Building Anamnesis Taught M…

── more in #developer-tools 4 stories · sorted by recency

dev.to · 4 Jul · #developer-tools

Make Any Website AI-Readable: Generating llms.txt Files with Python

dev.to · 4 Jul · #developer-tools

Picking an Agent Framework in 2026: An Honest Verdict on Six of Them

dev.to · 4 Jul · #developer-tools

Pydantic AI: Typed, Testable Agents for Engineers Who Like Guarantees

dev.to · 4 Jul · #developer-tools

The Internet Nobody Wrote

── more on @apify 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required