{"slug": "show-hn-cli-for-crawling-documentation-sites-into-markdown-with-defuddle", "title": "Show HN: CLI for crawling documentation sites into Markdown with defuddle", "summary": "Docrawl, a lightweight Node.js CLI tool for crawling documentation sites and converting them into Markdown, has been released. The tool supports static and server-rendered docs platforms like Docusaurus, VitePress, and MkDocs, and is designed to feed content into LLM contexts, RAG pipelines, and local knowledge bases without requiring a browser or JavaScript execution.", "body_md": "`docrawl`\n\nis a lightweight Node.js CLI for crawling documentation sites and converting them into Markdown with [ defuddle](https://github.com/kepano/defuddle).\n\nIt is built for static and server-rendered docs sites such as Docusaurus, VitePress, MkDocs, GitBook exports, and Obsidian Publish. It does not run a browser and does not execute page JavaScript.\n\n`docrawl`\n\nis useful when you want to:\n\n- turn docs sites into Markdown for LLM context\n- build local knowledge bases\n- feed content into RAG pipelines\n- archive clean docs content without a browser dependency\n\n- Node.js\n`>= 20`\n\nRun without installing:\n\n```\nnpx docrawl --help\n```\n\nInstall globally:\n\n```\nnpm install -g docrawl\n```\n\nThen run:\n\n```\ndocrawl --help\nnpm install\n```\n\nBuild:\n\n```\nnpm run build\n```\n\nRun the CLI from the project workspace:\n\n```\nnpm run start -- --help\n```\n\nRun tests:\n\n```\nnpm test\ndocrawl crawl <url> [options]\n```\n\nExamples:\n\n```\n# Crawl a docs section into ./output\ndocrawl crawl https://docs.example.com/guide/\n\n# Run a smaller smoke test first\ndocrawl crawl https://docs.example.com/guide/ --max-pages 10 --depth 1 --verbose\n\n# Merge everything into one file\ndocrawl crawl https://docs.example.com/guide/ --single-file --output ./context.md\n\n# Crawl the full hostname, not only the seed path subtree\ndocrawl crawl https://docs.example.com --domain --max-pages 200\n```\n\nOptions:\n\n```\n-o, --output <path>  Output directory or file path\n-s, --single-file    Merge all pages into one Markdown file\n    --domain         Crawl the whole hostname, not just the seed path\n    --depth <n>      Maximum crawl depth\n    --max-pages <n>  Maximum pages to process (default: 500)\n    --concurrency <n> Concurrent requests (default: 3)\n    --delay <ms>     Delay between requests per worker (default: 500)\n    --lang <code>    Preferred language, BCP 47\n    --no-sitemap     Disable sitemap discovery\n    --include <glob> Include URL glob pattern, repeatable\n    --exclude <glob> Exclude URL glob pattern, repeatable\n    --verbose        Verbose progress logging\ndocrawl parse <url> [options]\n```\n\nExamples:\n\n```\n# Parse one page as Markdown\ndocrawl parse https://docs.example.com/guide/intro\n\n# Parse one page as JSON\ndocrawl parse https://docs.example.com/guide/intro --json\n```\n\nOptions:\n\n```\n-j, --json      Output full JSON response\n    --lang <code> Preferred language, BCP 47\n```\n\nBy default, `docrawl crawl`\n\nwrites one Markdown file per successful page and a `manifest.json`\n\n.\n\nExample layout:\n\n```\noutput/\n├── getting-started/\n│   ├── introduction.md\n│   └── quickstart.md\n└── manifest.json\n```\n\nEach Markdown file includes frontmatter with fields such as:\n\n`title`\n\n`sourceUrl`\n\n`finalUrl`\n\n`canonicalUrl`\n\n`crawledAt`\n\n`depth`\n\n`wordCount`\n\n`contentHash`\n\nWith `--single-file`\n\n, `docrawl`\n\nwrites:\n\n- one merged Markdown file\n- one adjacent manifest file named like\n`<name>.manifest.json`\n\nThe merged file includes a table of contents and one section per successful page.\n\nExample:\n\n```\ndocrawl crawl https://docs.example.com --single-file --output ./context.md\n```\n\nProduces:\n\n```\ncontext.md\ncontext.manifest.json\n```\n\n`docrawl`\n\ncurrently does not handle:\n\n- JavaScript-rendered SPAs that need browser execution\n- login-gated or authenticated content\n- asset downloading\n`robots.txt`\n\ncompliance- resumable crawls\n- incremental recrawls\n- full navigation reconstruction", "url": "https://wpnews.pro/news/show-hn-cli-for-crawling-documentation-sites-into-markdown-with-defuddle", "canonical_source": "https://github.com/artemnistuley/docrawl", "published_at": "2026-06-03 20:08:28+00:00", "updated_at": "2026-06-03 20:49:36.676660+00:00", "lang": "en", "topics": ["ai-tools", "ai-infrastructure"], "entities": ["docrawl", "defuddle", "Node.js", "Docusaurus", "VitePress", "MkDocs", "GitBook", "Obsidian Publish"], "alternates": {"html": "https://wpnews.pro/news/show-hn-cli-for-crawling-documentation-sites-into-markdown-with-defuddle", "markdown": "https://wpnews.pro/news/show-hn-cli-for-crawling-documentation-sites-into-markdown-with-defuddle.md", "text": "https://wpnews.pro/news/show-hn-cli-for-crawling-documentation-sites-into-markdown-with-defuddle.txt", "jsonld": "https://wpnews.pro/news/show-hn-cli-for-crawling-documentation-sites-into-markdown-with-defuddle.jsonld"}}