cd /news/developer-tools/pagetomd-a-cli-tool-to-turn-web-page… · home topics developer-tools article
[ARTICLE · art-33734] src=github.com ↗ pub= topic=developer-tools verified=true sentiment=↑ positive

PageToMD – A CLI tool to turn web pages into clean Markdown for AI agents

PageToMD, a new CLI tool, converts web pages into clean Markdown optimized for AI agents, featuring NFC-normalized UTF-8 output, YAML frontmatter with metadata, and optional JavaScript rendering for SPAs. The tool supports pipx, uv, and pip installation, offers deterministic output for RAG ingestion, and includes a crawl mode for bulk conversion of linked sub-pages.

read10 min views1 publishedJun 19, 2026
PageToMD – A CLI tool to turn web pages into clean Markdown for AI agents
Image: source

Convert any webpage URL into clean, LLM-ready Markdown with frontmatter.

AI-ready by default. Output is NFC-normalised UTF-8, single H1, monotonic heading hierarchy, no zero-width junk, no tracking chrome — drops straight into a vector store or LLM prompt.Full-fidelity metadata. Every file ships with a YAML frontmatter block containing canonical URL, final URL after redirects, title, author, date, description, site name, language, and tool identity. No more "where did this Markdown come from?".Static fast, JS-capable when needed. Defaulthttpx

fetcher is sub-second; opt-inplaywright

extra (or--fetcher auto

) handles SPA shells without bloating the install for everyone else.Stable, scriptable CLI. Typer-built, full env-var precedence (PAGETOMD_*

), stable exit codes (0

/2

/3

/4

/5

/64

/130

), structured logs (--log-json

), and a--no-fetched-at

switch for byte-deterministic output.Notpandoc

orcurl + sed

.pandoc

doesn't fetch, doesn't strip boilerplate, and doesn't emit frontmatter. Hand-rolledcurl | html2md

pipelines re-invent extraction, mojibake handling, robots.txt, redirect caps, and atomic writes.pagetomd

is one command for the whole pipeline.

pipx install pagetomd
pipx inject pagetomd playwright && playwright install chromium
uv tool install pagetomd
uv tool install 'pagetomd[playwright]' && playwright install chromium
uv run --with pagetomd pagetomd https://example.com

uv run --with playwright playwright install chromium
uv run --with 'pagetomd[playwright]' pagetomd https://example.com --fetcher auto
pip install pagetomd                 # core
pip install 'pagetomd[playwright]'   # + SPA support
pagetomd https://example.com/blog/post

pagetomd https://example.com/blog/post -o -

pagetomd https://example.com/blog/post --no-fetched-at -o post.md

pagetomd https://my-spa.example.com -o - --fetcher auto

-o -

writes the Markdown to stdout. All logs go to stderr, so the stream is safe to pipe:

pagetomd https://example.com/blog/post -o - | llm "summarise this article in five bullet points"
while read -r url; do
  pagetomd "$url"
done < urls.txt

Each successful conversion exits 0

; any non-zero exit leaves the loop running but is visible in stderr (see Exit codes below).

Use --crawl

to discover every linked sub-page under a seed URL and write one .md

file per page into an output directory:

pagetomd "https://docs.example.com/guide/" \
  --crawl --crawl-depth 2 \
  --fetcher auto --no-respect-robots \
  -o ./docs-output/

Scope: The seed is treated as the root of its own subtree. Only links whose URL lives under the seed are followed; siblings, parents, and external sites are skipped. For a seed of https://docs.example.com/guide/intro

the in-scope prefix is https://docs.example.com/guide/intro/

— pass a trailing slash on the seed (or use a "directory" URL like /guide/

) to scope the crawl one level higher.

Output structure: The on-disk layout mirrors the URL hierarchy under the seed, so two pages with the same final URL segment under different parents do not collide:

URL Output file (relative to -o )
The seed itself index.md
…/guide/intro
intro.md
…/guide/intro/
intro/index.md
…/guide/concepts/alerts
concepts/alerts.md
…/guide/concepts/alerts/
concepts/alerts/index.md

Each path segment is slugified independently, and Windows-reserved device names (CON

, PRN

, …) are escaped per segment.

Options:

--crawl-depth N

— BFS hop limit from the seed (default:1

).--crawl-depth 10

against a site that naturally ends at depth 3 simply stops when the queue empties; nothing is wasted.--overwrite

— replace existing.md

files (default: skip). At the end of a crawl, three lists are printed to stderr: pages skipped because the file already exists, pages where no content could be extracted (auth walls, thin nav stubs), and pages that failed with a fetch or conversion error — so you can handle each category appropriately.- All other flags ( --fetcher

,--no-verify-ssl

,--user-agent

,--retries

, …) apply to every page in the crawl.--retries

honoursRetry-After

headers on 429/503 responses (capped at 5 minutes per attempt).

A single fetcher context is reused across the whole crawl, so browser backends do not relaunch Chromium per page.

pagetomd

has four ways to turn URLs into Markdown. Pick the one that matches your situation:

I want to… Use Why
Convert a single static page (blog, docs, article) pagetomd URL
Default httpx fetcher — fast, no extra deps.
Convert a page that needs JavaScript to render (React, Vue, Angular, Next.js) pagetomd URL --fetcher playwright
Launches headless Chromium so the SPA actually renders.
Convert a page and I'm not sure if it needs JS pagetomd URL --fetcher auto
Tries httpx first; falls back to Playwright if the page looks like an empty SPA shell or extraction comes back empty.
Crawl an entire site section into a folder of .md files
pagetomd URL --crawl -o dir/
BFS-walks every same-subtree link and writes one file per page. Combine with --fetcher auto if some pages are JS-rendered.

** httpx** (default) — A plain HTTP GET. Sub-second for most pages, handles retries with exponential backoff, honours

Retry-After

on 429/503, enforces robots.txt

, and follows <meta http-equiv="refresh">

redirects. No JavaScript execution — if the server sends an empty <div id="root"></div>

shell, that's all you get.** playwright** — Renders the page in headless Chromium, waits for network idle, then serialises the live DOM (including shadow roots). Use this when you

knowthe page is a SPA. Requires the optional

playwright

extra (pip install 'pagetomd[playwright]'

) and a one-time playwright install chromium

. Slower and heavier than httpx

, but the only way to get content that lives behind a JS framework.** auto** — Fetches with

httpx

first, then inspects the result: if the <body>

text is under 200 characters andthe HTML contains SPA markers (

data-reactroot

, <div id="__next">

, a "you need to enable javascript" noscript tag, etc.), it re-fetches with Playwright. A second safety net fires if httpx

returned HTML that lookednon-empty but the extractor still couldn't pull any content — Playwright gets a shot then too. If Playwright is unavailable, the page is counted as "empty" in the crawl summary rather than a hard failure. Best choice when you're pointed at an unfamiliar URL.

Use the default single-page mode when you have a specific URL (or a short list piped through a while read

loop). Use ** --crawl** when you want every page under a URL prefix — it discovers links automatically, deduplicates, mirrors the URL hierarchy on disk, and reuses a single fetcher context so Playwright doesn't relaunch Chromium per page. See the

crawl cookbook recipefor the full flag set.

Running pagetomd http://127.0.0.1:8765/blog.html --no-fetched-at -o -

against the blog.html

fixture prints (first ~15 lines shown):

---
url: http://127.0.0.1:8765/blog.html
final_url: http://127.0.0.1:8765/blog.html
title: Why We Rewrote Our Build System in Rust
author: Jane Doe
date: '2024-08-14'
description: A retrospective on migrating our monorepo build pipeline from Python to Rust, and what we learned along the way.
site_name: Example Engineering Blog
language: en
tool: pagetomd
tool_version: 0.4.0
---


Three years ago, our monorepo build pipeline was a sprawling Python application held together with shell scripts and prayer. ...

When fetched_at

is enabled (the default), an extra fetched_at: '2026-06-15T12:34:56Z'

line is included in the frontmatter. Fields whose value cannot be detected (e.g. language

, author

) are omitted from the YAML.

A compact overview — see pagetomd --help

for the full list.

Flag Default Description
--output / -o
derived from title
Output path, or - for stdout.
--overwrite
false
Replace an existing destination file.
--follow-symlinks / --no-follow-symlinks
false
Allow writes to a symlinked destination. Off by default so --overwrite cannot be tricked into clobbering a file outside the intended directory via a symlink.
--fetcher
httpx
httpx , playwright , or auto .
--timeout
30.0
Per-request HTTP timeout (seconds).
--retries
4
Per-page retry attempts on transient failures (default 4 = up to 5 total attempts). Honours the server's Retry-After header on 429/503 responses, capped at 5 minutes; falls back to exponential backoff otherwise.
--user-agent
pagetomd/<ver>
Override the outbound User-Agent .
--no-verify-ssl
false
Disable TLS certificate verification (for corporate proxies that re-sign HTTPS).
--respect-robots / --no-respect-robots
true
Honour robots.txt (relaxed for loopback/RFC 1918).
--max-redirects
10
Cap on the redirect chain length.
--include-comments / --no-include-comments
false
Preserve HTML comments in the extracted document.
--include-images / --no-include-images
true
Keep image syntax in output.
--include-links / --no-include-links
true
Keep link URLs in output.
--heading-style
atx
atx (# ) or setext (=== ).
--code-fences / --no-code-fences
true
Use fenced code blocks instead of indented ones.
--wide-tables
kv
Wide-table strategy: kv , html , or drop .
--no-fetched-at
false
Omit fetched_at for byte-deterministic output.
--log-level
info
debug , info , warning , error .
--log-json
false
Emit logs as JSON lines on stderr.
--debug
false
Shortcut for --log-level=debug + tracebacks on error.
--playwright-idle-ms
500
Extra wait (ms) after networkidle for late-firing scripts (Playwright fetcher only).
--crawl
false
Crawl all linked sub-pages under the seed URL's path prefix and write one .md file per page. Requires -o to be a directory.
--crawl-depth
1
Maximum BFS depth from the seed URL when --crawl is active. 0 = seed only.
--retry-failed / --no-retry-failed
true
After --crawl finishes, retry pages that failed in the initial pass once.
--version
Print the installed version and exit.

Every flag has a PAGETOMD_<UPPER_NAME>

equivalent. For example:

PAGETOMD_TIMEOUT=60 PAGETOMD_FETCHER=auto pagetomd https://example.com

CLI flags always override env vars; env vars override the built-in defaults.

Code Meaning
0
Success.
1
Unexpected internal error.
2
Fetch failure (DNS, HTTP, robots.txt, redirect cap).
3
Extraction or conversion failure (empty body, malformed HTML).
4
Output write failure (permissions, disk, atomic-rename clash).
5
Missing optional dependency (e.g. playwright not installed).
64
Usage or configuration error (bad flag, invalid value).
130
Interrupted by user (Ctrl-C).

One paragraph plus a diagram of the pipeline:

URL ──► Fetcher ──► Extractor ──► Converter ──► Postprocess ──► Writer
       (httpx /     (BS4 clean    (markdownify    (NFC, heading   (atomic
        playwright)  + trafilatura) + GFM tables)  hierarchy,      file +
                                                  URL absolutise)  YAML)

The fetcher (httpx

by default, playwright

for SPAs) downloads the page with retries and robots.txt enforcement. The extractor runs a BeautifulSoup pre-clean pass (strip scripts/styles/nav/ads) then hands the cleaned tree to trafilatura

to identify main content and harvest metadata. The converter renders the surviving HTML to Markdown via a customised markdownify

subclass (ATX headings, fenced code blocks with language hints, GFM tables with wide-table fallbacks). The postprocessor enforces the AI-readiness contract (NFC, zero-width strip, monotonic heading hierarchy, absolute URLs). The writer prepends a YAML frontmatter block and writes atomically (or streams to stdout).

pagetomd

is a public-URL-only tool. It refuses to fetch private, loopback, link-local, multicast, reserved, or cloud-metadata addresses by default — and there is no flag to override that. Treat output files as having the same sensitivity as the URL they were fetched from.

CI enforces both a project-wide test coverage floor of 85% and a per-module floor of 90% (line + branch combined) on the four critical modules — extractor,

,

converter

, and

writer

. These four carry the AI-readiness contract, so they get the strictest coverage bar.

postprocess

git clone https://github.com/gs202/PageToMD.git
cd pagetomd
uv sync --extra dev --extra playwright
pre-commit install
uv run pytest

See CONTRIBUTING.md for the full contributor workflow.

Business Source License 1.1 — source-available, free for non-commercial use. Converts to MIT on 2030-06-16. See LICENSE for full terms.

── more in #developer-tools 4 stories · sorted by recency
── more on @pagetomd 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/pagetomd-a-cli-tool-…] indexed:0 read:10min 2026-06-19 ·