docrawl
is a lightweight Node.js CLI for crawling documentation sites and converting them into Markdown with defuddle.
It is built for static and server-rendered docs sites such as Docusaurus, VitePress, MkDocs, GitBook exports, and Obsidian Publish. It does not run a browser and does not execute page JavaScript.
docrawl
is useful when you want to:
-
turn docs sites into Markdown for LLM context
-
build local knowledge bases
-
feed content into RAG pipelines
-
archive clean docs content without a browser dependency
-
Node.js
>= 20
Run without installing:
npx docrawl --help
Install globally:
npm install -g docrawl
Then run:
docrawl --help
npm install
Build:
npm run build
Run the CLI from the project workspace:
npm run start -- --help
Run tests:
npm test
docrawl crawl <url> [options]
Examples:
docrawl crawl https://docs.example.com/guide/
docrawl crawl https://docs.example.com/guide/ --max-pages 10 --depth 1 --verbose
docrawl crawl https://docs.example.com/guide/ --single-file --output ./context.md
docrawl crawl https://docs.example.com --domain --max-pages 200
Options:
-o, --output <path> Output directory or file path
-s, --single-file Merge all pages into one Markdown file
--domain Crawl the whole hostname, not just the seed path
--depth <n> Maximum crawl depth
--max-pages <n> Maximum pages to process (default: 500)
--concurrency <n> Concurrent requests (default: 3)
--delay <ms> Delay between requests per worker (default: 500)
--lang <code> Preferred language, BCP 47
--no-sitemap Disable sitemap discovery
--include <glob> Include URL glob pattern, repeatable
--exclude <glob> Exclude URL glob pattern, repeatable
--verbose Verbose progress logging
docrawl parse <url> [options]
Examples:
docrawl parse https://docs.example.com/guide/intro
docrawl parse https://docs.example.com/guide/intro --json
Options:
-j, --json Output full JSON response
--lang <code> Preferred language, BCP 47
By default, docrawl crawl
writes one Markdown file per successful page and a manifest.json
.
Example layout:
output/
├── getting-started/
│ ├── introduction.md
│ └── quickstart.md
└── manifest.json
Each Markdown file includes frontmatter with fields such as:
title
sourceUrl
finalUrl
canonicalUrl
crawledAt
depth
wordCount
contentHash
With --single-file
, docrawl
writes:
- one merged Markdown file
- one adjacent manifest file named like
<name>.manifest.json
The merged file includes a table of contents and one section per successful page.
Example:
docrawl crawl https://docs.example.com --single-file --output ./context.md
Produces:
context.md
context.manifest.json
docrawl
currently does not handle:
- JavaScript-rendered SPAs that need browser execution
- login-gated or authenticated content
- asset down
robots.txt
compliance- resumable crawls
- incremental recrawls
- full navigation reconstruction