PageToMD – A CLI tool to turn web pages into clean Markdown for AI agents PageToMD, a new CLI tool, converts web pages into clean Markdown optimized for AI agents, featuring NFC-normalized UTF-8 output, YAML frontmatter with metadata, and optional JavaScript rendering for SPAs. The tool supports pipx, uv, and pip installation, offers deterministic output for RAG ingestion, and includes a crawl mode for bulk conversion of linked sub-pages. Convert any webpage URL into clean, LLM-ready Markdown with frontmatter. AI-ready by default. Output is NFC-normalised UTF-8, single H1, monotonic heading hierarchy, no zero-width junk, no tracking chrome — drops straight into a vector store or LLM prompt. Full-fidelity metadata. Every file ships with a YAML frontmatter block containing canonical URL, final URL after redirects, title, author, date, description, site name, language, and tool identity. No more "where did this Markdown come from?". Static fast, JS-capable when needed. Default httpx fetcher is sub-second; opt-in playwright extra or --fetcher auto handles SPA shells without bloating the install for everyone else. Stable, scriptable CLI. Typer-built, full env-var precedence PAGETOMD , stable exit codes 0 / 2 / 3 / 4 / 5 / 64 / 130 , structured logs --log-json , and a --no-fetched-at switch for byte-deterministic output. Not pandoc or curl + sed . pandoc doesn't fetch, doesn't strip boilerplate, and doesn't emit frontmatter. Hand-rolled curl | html2md pipelines re-invent extraction, mojibake handling, robots.txt, redirect caps, and atomic writes. pagetomd is one command for the whole pipeline. pipx install pagetomd optional: enable JS rendering for SPAs pipx inject pagetomd playwright && playwright install chromium uv tool install pagetomd optional: enable JS rendering for SPAs uv tool install 'pagetomd playwright ' && playwright install chromium Core — no install required uv run --with pagetomd pagetomd https://example.com With Playwright for SPA / JS-heavy pages install Chromium once first uv run --with playwright playwright install chromium uv run --with 'pagetomd playwright ' pagetomd https://example.com --fetcher auto pip install pagetomd core pip install 'pagetomd playwright ' + SPA support Default: derives output filename from the page title pagetomd https://example.com/blog/post Stream to stdout pipe into LLMs, etc. pagetomd https://example.com/blog/post -o - Deterministic output omits fetched at — good for snapshot tests / RAG ingestion pagetomd https://example.com/blog/post --no-fetched-at -o post.md Auto-detect SPA pages and fall back to headless Chromium pagetomd https://my-spa.example.com -o - --fetcher auto -o - writes the Markdown to stdout. All logs go to stderr, so the stream is safe to pipe: pagetomd https://example.com/blog/post -o - | llm "summarise this article in five bullet points" while read -r url; do pagetomd "$url" done < urls.txt Each successful conversion exits 0 ; any non-zero exit leaves the loop running but is visible in stderr see Exit codes exit-codes below . Use --crawl to discover every linked sub-page under a seed URL and write one .md file per page into an output directory: pagetomd "https://docs.example.com/guide/" \ --crawl --crawl-depth 2 \ --fetcher auto --no-respect-robots \ -o ./docs-output/ Scope: The seed is treated as the root of its own subtree. Only links whose URL lives under the seed are followed; siblings, parents, and external sites are skipped. For a seed of https://docs.example.com/guide/intro the in-scope prefix is https://docs.example.com/guide/intro/ — pass a trailing slash on the seed or use a "directory" URL like /guide/ to scope the crawl one level higher. Output structure: The on-disk layout mirrors the URL hierarchy under the seed, so two pages with the same final URL segment under different parents do not collide: | URL | Output file relative to -o | |---|---| | The seed itself | index.md | …/guide/intro | intro.md | …/guide/intro/ | intro/index.md | …/guide/concepts/alerts | concepts/alerts.md | …/guide/concepts/alerts/ | concepts/alerts/index.md | Each path segment is slugified independently, and Windows-reserved device names CON , PRN , … are escaped per segment. Options: --crawl-depth N — BFS hop limit from the seed default: 1 . --crawl-depth 10 against a site that naturally ends at depth 3 simply stops when the queue empties; nothing is wasted. --overwrite — replace existing .md files default: skip . At the end of a crawl, three lists are printed to stderr: pages skipped because the file already exists, pages where no content could be extracted auth walls, thin nav stubs , and pages that failed with a fetch or conversion error — so you can handle each category appropriately.- All other flags --fetcher , --no-verify-ssl , --user-agent , --retries , … apply to every page in the crawl. --retries honours Retry-After headers on 429/503 responses capped at 5 minutes per attempt . A single fetcher context is reused across the whole crawl, so browser backends do not relaunch Chromium per page. pagetomd has four ways to turn URLs into Markdown. Pick the one that matches your situation: | I want to… | Use | Why | |---|---|---| | Convert a single static page blog, docs, article | pagetomd URL | Default httpx fetcher — fast, no extra deps. | | Convert a page that needs JavaScript to render React, Vue, Angular, Next.js | pagetomd URL --fetcher playwright | Launches headless Chromium so the SPA actually renders. | | Convert a page and I'm not sure if it needs JS | pagetomd URL --fetcher auto | Tries httpx first; falls back to Playwright if the page looks like an empty SPA shell or extraction comes back empty. | Crawl an entire site section into a folder of .md files | pagetomd URL --crawl -o dir/ | BFS-walks every same-subtree link and writes one file per page. Combine with --fetcher auto if some pages are JS-rendered. | httpx default — A plain HTTP GET. Sub-second for most pages, handles retries with exponential backoff, honours Retry-After on 429/503, enforces robots.txt , and follows