{"slug": "how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training", "title": "How to Extract Clean Content From Any Website Sitemap (For SEO Audits & AI Training)", "summary": "A developer built a Sitemap Content Extractor that crawls every page from a sitemap.xml URL and extracts structured content such as title, meta description, headings, and clean text. The tool, available on Apify Store, handles nested sitemaps and uses trafilatura to remove boilerplate, making it useful for SEO audits, AI training data extraction, and content migration.", "body_md": "Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?\n\nI built a **Sitemap Content Extractor** that does exactly this — feed it a `sitemap.xml`\n\nURL and it crawls every page, extracting structured content.\n\n`.xml.gz`\n\nfiles work out of the boxYou can run it directly on [Apify Store](https://apify.com/darknezz/sitemap-content-extractor) — no setup required.\n\nJust provide:\n\n`https://example.com/sitemap.xml`\n\n)\n\n```\n{\n  \"url\": \"https://pydantic.dev/docs/\",\n  \"title\": \"Pydantic Docs - Validation, AI Agents, Logfire Observability\",\n  \"content\": \"Full extracted article text...\",\n  \"wordCount\": 131,\n  \"metaDescription\": \"Pydantic documentation...\",\n  \"h1Headings\": [\"Pydantic Docs\"],\n  \"lastmod\": \"2025-01-15\",\n  \"extractedAt\": \"2026-07-04T10:45:00Z\"\n}\n```\n\nCrawl your entire site and identify pages with:\n\n`lastmod`\n\ndates)Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.\n\nInventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.\n\nBefore migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.\n\nThe extractor is built in Python 3.12 and uses:\n\nIt handles both `<urlset>`\n\n(regular sitemaps) and `<sitemapindex>`\n\n(nested sitemaps), following child sitemaps recursively.\n\nTry it now on the [Apify Store](https://apify.com/darknezz/sitemap-content-extractor)\n\nNo registration needed — just paste a sitemap URL and hit run.\n\n*What would you use a sitemap extractor for? Let me know in the comments!*", "url": "https://wpnews.pro/news/how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training", "canonical_source": "https://dev.to/oaida_adrian_afa2428f63d0/how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training-15a9", "published_at": "2026-07-04 10:50:51+00:00", "updated_at": "2026-07-04 11:19:08.813188+00:00", "lang": "en", "topics": ["developer-tools", "artificial-intelligence", "large-language-models"], "entities": ["Apify", "trafilatura", "Pydantic", "Python"], "alternates": {"html": "https://wpnews.pro/news/how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training", "markdown": "https://wpnews.pro/news/how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training.md", "text": "https://wpnews.pro/news/how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training.txt", "jsonld": "https://wpnews.pro/news/how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training.jsonld"}}