How to Extract Clean Content From Any Website Sitemap (For SEO Audits & AI Training) A developer built a Sitemap Content Extractor that crawls every page from a sitemap.xml URL and extracts structured content such as title, meta description, headings, and clean text. The tool, available on Apify Store, handles nested sitemaps and uses trafilatura to remove boilerplate, making it useful for SEO audits, AI training data extraction, and content migration. Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain? I built a Sitemap Content Extractor that does exactly this — feed it a sitemap.xml URL and it crawls every page, extracting structured content. .xml.gz files work out of the boxYou can run it directly on Apify Store https://apify.com/darknezz/sitemap-content-extractor — no setup required. Just provide: https://example.com/sitemap.xml { "url": "https://pydantic.dev/docs/", "title": "Pydantic Docs - Validation, AI Agents, Logfire Observability", "content": "Full extracted article text...", "wordCount": 131, "metaDescription": "Pydantic documentation...", "h1Headings": "Pydantic Docs" , "lastmod": "2025-01-15", "extractedAt": "2026-07-04T10:45:00Z" } Crawl your entire site and identify pages with: lastmod dates Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content. Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover. Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS. The extractor is built in Python 3.12 and uses: It handles both