Show HN: CLI for crawling documentation sites into Markdown with defuddle

wpnews.pro

cd /news/ai-tools/show-hn-cli-for-crawling-documentati… · home › topics › ai-tools › article

[ARTICLE · art-20812] src=github.com ↗ pub=2026-06-03T20:08Z topic=ai-tools verified=true sentiment=↑ positive

Show HN: CLI for crawling documentation sites into Markdown with defuddle

Docrawl, a lightweight Node.js CLI tool for crawling documentation sites and converting them into Markdown, has been released. The tool supports static and server-rendered docs platforms like Docusaurus, VitePress, and MkDocs, and is designed to feed content into LLM contexts, RAG pipelines, and local knowledge bases without requiring a browser or JavaScript execution.

read2 min views14 publishedJun 3, 2026

docrawl

is a lightweight Node.js CLI for crawling documentation sites and converting them into Markdown with defuddle.

It is built for static and server-rendered docs sites such as Docusaurus, VitePress, MkDocs, GitBook exports, and Obsidian Publish. It does not run a browser and does not execute page JavaScript.

docrawl

is useful when you want to:

turn docs sites into Markdown for LLM context
build local knowledge bases
feed content into RAG pipelines
archive clean docs content without a browser dependency
Node.js >= 20

Run without installing:

npx docrawl --help

Install globally:

npm install -g docrawl

Then run:

docrawl --help
npm install

Build:

npm run build

Run the CLI from the project workspace:

npm run start -- --help

Run tests:

npm test
docrawl crawl <url> [options]

Examples:

docrawl crawl https://docs.example.com/guide/

docrawl crawl https://docs.example.com/guide/ --max-pages 10 --depth 1 --verbose

docrawl crawl https://docs.example.com/guide/ --single-file --output ./context.md

docrawl crawl https://docs.example.com --domain --max-pages 200

Options:

-o, --output <path>  Output directory or file path
-s, --single-file    Merge all pages into one Markdown file
    --domain         Crawl the whole hostname, not just the seed path
    --depth <n>      Maximum crawl depth
    --max-pages <n>  Maximum pages to process (default: 500)
    --concurrency <n> Concurrent requests (default: 3)
    --delay <ms>     Delay between requests per worker (default: 500)
    --lang <code>    Preferred language, BCP 47
    --no-sitemap     Disable sitemap discovery
    --include <glob> Include URL glob pattern, repeatable
    --exclude <glob> Exclude URL glob pattern, repeatable
    --verbose        Verbose progress logging
docrawl parse <url> [options]

Examples:

docrawl parse https://docs.example.com/guide/intro

docrawl parse https://docs.example.com/guide/intro --json

Options:

-j, --json      Output full JSON response
    --lang <code> Preferred language, BCP 47

By default, docrawl crawl

writes one Markdown file per successful page and a manifest.json

Example layout:

output/
├── getting-started/
│   ├── introduction.md
│   └── quickstart.md
└── manifest.json

Each Markdown file includes frontmatter with fields such as:

title

sourceUrl

finalUrl

canonicalUrl

crawledAt

depth

wordCount

contentHash

With --single-file

, docrawl

writes:

one merged Markdown file
one adjacent manifest file named like <name>.manifest.json

The merged file includes a table of contents and one section per successful page.

Example:

docrawl crawl https://docs.example.com --single-file --output ./context.md

Produces:

context.md
context.manifest.json

docrawl

currently does not handle:

JavaScript-rendered SPAs that need browser execution
login-gated or authenticated content
asset down robots.txt

compliance- resumable crawls

incremental recrawls
full navigation reconstruction

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/show-hn-cli-for-crawling…

Read original on github.com → github.com/artemnistuley/docrawl

mentioned entities

docrawl

defuddle

Node.js

Docusaurus

VitePress

MkDocs

GitBook

Obsidian Publish

metadata

slugshow-hn-cli-for-crawling-documentation-sites-into-markdown-with-defuddle

topic#ai-tools

secondary1 topics

sentimentpositive

canonicalgithub.com

navigation

← prevThe Mythos Era of Threat Defense…

next →What Martin Scorsese’s AI Embrac…

── more in #ai-tools 4 stories · sorted by recency

cio.com · 23 Jul · #ai-tools

7 CRM trends for 2026: AI brings decisive action to customer workflows

cast.ai · 23 Jul · #ai-tools

Multi-Cloud and Cross-Region GPU Capacity for Kubernetes AI

startupfortune.com · 23 Jul · #ai-tools

Westinghouse lands an $80 billion nuclear contract and AI is the reason why

ktransformers.net · 23 Jul · #ai-tools

KTransformers – Flexible LLM Inference Framework

── more on @docrawl 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 2 Jun · #ai-startups

Y Combinator Requests for Startups, 2008–2026

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required