cd /news/ai-tools/show-hn-cli-for-crawling-documentati… · home topics ai-tools article
[ARTICLE · art-20812] src=github.com pub= topic=ai-tools verified=true sentiment=↑ positive

Show HN: CLI for crawling documentation sites into Markdown with defuddle

Docrawl, a lightweight Node.js CLI tool for crawling documentation sites and converting them into Markdown, has been released. The tool supports static and server-rendered docs platforms like Docusaurus, VitePress, and MkDocs, and is designed to feed content into LLM contexts, RAG pipelines, and local knowledge bases without requiring a browser or JavaScript execution.

read2 min publishedJun 3, 2026

docrawl

is a lightweight Node.js CLI for crawling documentation sites and converting them into Markdown with defuddle.

It is built for static and server-rendered docs sites such as Docusaurus, VitePress, MkDocs, GitBook exports, and Obsidian Publish. It does not run a browser and does not execute page JavaScript.

docrawl

is useful when you want to:

  • turn docs sites into Markdown for LLM context

  • build local knowledge bases

  • feed content into RAG pipelines

  • archive clean docs content without a browser dependency

  • Node.js >= 20

Run without installing:

npx docrawl --help

Install globally:

npm install -g docrawl

Then run:

docrawl --help
npm install

Build:

npm run build

Run the CLI from the project workspace:

npm run start -- --help

Run tests:

npm test
docrawl crawl <url> [options]

Examples:

docrawl crawl https://docs.example.com/guide/

docrawl crawl https://docs.example.com/guide/ --max-pages 10 --depth 1 --verbose

docrawl crawl https://docs.example.com/guide/ --single-file --output ./context.md

docrawl crawl https://docs.example.com --domain --max-pages 200

Options:

-o, --output <path>  Output directory or file path
-s, --single-file    Merge all pages into one Markdown file
    --domain         Crawl the whole hostname, not just the seed path
    --depth <n>      Maximum crawl depth
    --max-pages <n>  Maximum pages to process (default: 500)
    --concurrency <n> Concurrent requests (default: 3)
    --delay <ms>     Delay between requests per worker (default: 500)
    --lang <code>    Preferred language, BCP 47
    --no-sitemap     Disable sitemap discovery
    --include <glob> Include URL glob pattern, repeatable
    --exclude <glob> Exclude URL glob pattern, repeatable
    --verbose        Verbose progress logging
docrawl parse <url> [options]

Examples:

docrawl parse https://docs.example.com/guide/intro

docrawl parse https://docs.example.com/guide/intro --json

Options:

-j, --json      Output full JSON response
    --lang <code> Preferred language, BCP 47

By default, docrawl crawl

writes one Markdown file per successful page and a manifest.json

.

Example layout:

output/
├── getting-started/
│   ├── introduction.md
│   └── quickstart.md
└── manifest.json

Each Markdown file includes frontmatter with fields such as:

title

sourceUrl

finalUrl

canonicalUrl

crawledAt

depth

wordCount

contentHash

With --single-file

, docrawl

writes:

  • one merged Markdown file
  • one adjacent manifest file named like <name>.manifest.json

The merged file includes a table of contents and one section per successful page.

Example:

docrawl crawl https://docs.example.com --single-file --output ./context.md

Produces:

context.md
context.manifest.json

docrawl

currently does not handle:

  • JavaScript-rendered SPAs that need browser execution
  • login-gated or authenticated content
  • asset down robots.txt

compliance- resumable crawls

  • incremental recrawls
  • full navigation reconstruction
── more in #ai-tools 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-cli-for-craw…] indexed:0 read:2min 2026-06-03 ·