{"slug": "pagetomd-a-cli-tool-to-turn-web-pages-into-clean-markdown-for-ai-agents", "title": "PageToMD – A CLI tool to turn web pages into clean Markdown for AI agents", "summary": "PageToMD, a new CLI tool, converts web pages into clean Markdown optimized for AI agents, featuring NFC-normalized UTF-8 output, YAML frontmatter with metadata, and optional JavaScript rendering for SPAs. The tool supports pipx, uv, and pip installation, offers deterministic output for RAG ingestion, and includes a crawl mode for bulk conversion of linked sub-pages.", "body_md": "Convert any webpage URL into clean, LLM-ready Markdown with frontmatter.\n\n**AI-ready by default.** Output is NFC-normalised UTF-8, single H1, monotonic heading hierarchy, no zero-width junk, no tracking chrome — drops straight into a vector store or LLM prompt.**Full-fidelity metadata.** Every file ships with a YAML frontmatter block containing canonical URL, final URL after redirects, title, author, date, description, site name, language, and tool identity. No more \"where did this Markdown come from?\".**Static fast, JS-capable when needed.** Default`httpx`\n\nfetcher is sub-second; opt-in`playwright`\n\nextra (or`--fetcher auto`\n\n) handles SPA shells without bloating the install for everyone else.**Stable, scriptable CLI.** Typer-built, full env-var precedence (`PAGETOMD_*`\n\n), stable exit codes (`0`\n\n/`2`\n\n/`3`\n\n/`4`\n\n/`5`\n\n/`64`\n\n/`130`\n\n), structured logs (`--log-json`\n\n), and a`--no-fetched-at`\n\nswitch for byte-deterministic output.**Not**`pandoc`\n\nor`curl + sed`\n\n.`pandoc`\n\ndoesn't fetch, doesn't strip boilerplate, and doesn't emit frontmatter. Hand-rolled`curl | html2md`\n\npipelines re-invent extraction, mojibake handling, robots.txt, redirect caps, and atomic writes.`pagetomd`\n\nis one command for the whole pipeline.\n\n```\npipx install pagetomd\n# optional: enable JS rendering for SPAs\npipx inject pagetomd playwright && playwright install chromium\nuv tool install pagetomd\n# optional: enable JS rendering for SPAs\nuv tool install 'pagetomd[playwright]' && playwright install chromium\n# Core — no install required\nuv run --with pagetomd pagetomd https://example.com\n\n# With Playwright for SPA / JS-heavy pages (install Chromium once first)\nuv run --with playwright playwright install chromium\nuv run --with 'pagetomd[playwright]' pagetomd https://example.com --fetcher auto\npip install pagetomd                 # core\npip install 'pagetomd[playwright]'   # + SPA support\n# Default: derives output filename from the page title\npagetomd https://example.com/blog/post\n\n# Stream to stdout (pipe into LLMs, etc.)\npagetomd https://example.com/blog/post -o -\n\n# Deterministic output (omits fetched_at — good for snapshot tests / RAG ingestion)\npagetomd https://example.com/blog/post --no-fetched-at -o post.md\n\n# Auto-detect SPA pages and fall back to headless Chromium\npagetomd https://my-spa.example.com -o - --fetcher auto\n```\n\n`-o -`\n\nwrites the Markdown to stdout. All logs go to stderr, so the stream is safe to pipe:\n\n```\npagetomd https://example.com/blog/post -o - | llm \"summarise this article in five bullet points\"\nwhile read -r url; do\n  pagetomd \"$url\"\ndone < urls.txt\n```\n\nEach successful conversion exits `0`\n\n; any non-zero exit leaves the loop\nrunning but is visible in stderr (see [Exit codes](#exit-codes) below).\n\nUse `--crawl`\n\nto discover every linked sub-page under a seed URL and write\none `.md`\n\nfile per page into an output directory:\n\n```\npagetomd \"https://docs.example.com/guide/\" \\\n  --crawl --crawl-depth 2 \\\n  --fetcher auto --no-respect-robots \\\n  -o ./docs-output/\n```\n\n**Scope:** The seed is treated as the root of its own subtree. Only links\nwhose URL lives *under* the seed are followed; siblings, parents, and\nexternal sites are skipped. For a seed of\n`https://docs.example.com/guide/intro`\n\nthe in-scope prefix is\n`https://docs.example.com/guide/intro/`\n\n— pass a trailing slash on the\nseed (or use a \"directory\" URL like `/guide/`\n\n) to scope the crawl one\nlevel higher.\n\n**Output structure:** The on-disk layout mirrors the URL hierarchy under\nthe seed, so two pages with the same final URL segment under different\nparents do not collide:\n\n| URL | Output file (relative to `-o` ) |\n|---|---|\n| The seed itself | `index.md` |\n`…/guide/intro` |\n`intro.md` |\n`…/guide/intro/` |\n`intro/index.md` |\n`…/guide/concepts/alerts` |\n`concepts/alerts.md` |\n`…/guide/concepts/alerts/` |\n`concepts/alerts/index.md` |\n\nEach path segment is slugified independently, and Windows-reserved device\nnames (`CON`\n\n, `PRN`\n\n, …) are escaped per segment.\n\n**Options:**\n\n`--crawl-depth N`\n\n— BFS hop limit from the seed (default:`1`\n\n).`--crawl-depth 10`\n\nagainst a site that naturally ends at depth 3 simply stops when the queue empties; nothing is wasted.`--overwrite`\n\n— replace existing`.md`\n\nfiles (default: skip). At the end of a crawl, three lists are printed to stderr: pages skipped because the file already exists, pages where no content could be extracted (auth walls, thin nav stubs), and pages that failed with a fetch or conversion error — so you can handle each category appropriately.- All other flags (\n`--fetcher`\n\n,`--no-verify-ssl`\n\n,`--user-agent`\n\n,`--retries`\n\n, …) apply to every page in the crawl.`--retries`\n\nhonours`Retry-After`\n\nheaders on 429/503 responses (capped at 5 minutes per attempt).\n\nA single fetcher context is reused across the whole crawl, so browser backends do not relaunch Chromium per page.\n\n`pagetomd`\n\nhas four ways to turn URLs into Markdown. Pick the one that matches your situation:\n\n| I want to… | Use | Why |\n|---|---|---|\n| Convert a single static page (blog, docs, article) | `pagetomd URL` |\nDefault `httpx` fetcher — fast, no extra deps. |\n| Convert a page that needs JavaScript to render (React, Vue, Angular, Next.js) | `pagetomd URL --fetcher playwright` |\nLaunches headless Chromium so the SPA actually renders. |\n| Convert a page and I'm not sure if it needs JS | `pagetomd URL --fetcher auto` |\nTries `httpx` first; falls back to Playwright if the page looks like an empty SPA shell or extraction comes back empty. |\nCrawl an entire site section into a folder of `.md` files |\n`pagetomd URL --crawl -o dir/` |\nBFS-walks every same-subtree link and writes one file per page. Combine with `--fetcher auto` if some pages are JS-rendered. |\n\n** httpx** (default) — A plain HTTP GET. Sub-second for most pages, handles retries with exponential backoff, honours\n\n`Retry-After`\n\non 429/503, enforces `robots.txt`\n\n, and follows `<meta http-equiv=\"refresh\">`\n\nredirects. No JavaScript execution — if the server sends an empty `<div id=\"root\"></div>`\n\nshell, that's all you get.** playwright** — Renders the page in headless Chromium, waits for network idle, then serialises the live DOM (including shadow roots). Use this when you\n\n*know*the page is a SPA. Requires the optional\n\n`playwright`\n\nextra (`pip install 'pagetomd[playwright]'`\n\n) and a one-time `playwright install chromium`\n\n. Slower and heavier than `httpx`\n\n, but the only way to get content that lives behind a JS framework.** auto** — Fetches with\n\n`httpx`\n\nfirst, then inspects the result: if the `<body>`\n\ntext is under 200 characters *and*the HTML contains SPA markers (\n\n`data-reactroot`\n\n, `<div id=\"__next\">`\n\n, a \"you need to enable javascript\" noscript tag, etc.), it re-fetches with Playwright. A second safety net fires if `httpx`\n\nreturned HTML that *looked*non-empty but the extractor still couldn't pull any content — Playwright gets a shot then too. If Playwright is unavailable, the page is counted as \"empty\" in the crawl summary rather than a hard failure. Best choice when you're pointed at an unfamiliar URL.\n\nUse the **default single-page mode** when you have a specific URL (or a short list piped through a `while read`\n\nloop). Use ** --crawl** when you want every page under a URL prefix — it discovers links automatically, deduplicates, mirrors the URL hierarchy on disk, and reuses a single fetcher context so Playwright doesn't relaunch Chromium per page. See the\n\n[crawl cookbook recipe](#crawl-an-entire-documentation-site)for the full flag set.\n\nRunning `pagetomd http://127.0.0.1:8765/blog.html --no-fetched-at -o -`\n\nagainst the `blog.html`\n\nfixture prints (first ~15 lines shown):\n\n```\n---\nurl: http://127.0.0.1:8765/blog.html\nfinal_url: http://127.0.0.1:8765/blog.html\ntitle: Why We Rewrote Our Build System in Rust\nauthor: Jane Doe\ndate: '2024-08-14'\ndescription: A retrospective on migrating our monorepo build pipeline from Python to Rust, and what we learned along the way.\nsite_name: Example Engineering Blog\nlanguage: en\ntool: pagetomd\ntool_version: 0.4.0\n---\n\n# Why We Rewrote Our Build System in Rust\n\nThree years ago, our monorepo build pipeline was a sprawling Python application held together with shell scripts and prayer. ...\n```\n\nWhen `fetched_at`\n\nis enabled (the default), an extra `fetched_at: '2026-06-15T12:34:56Z'`\n\nline is included in the frontmatter. Fields whose value cannot be detected (e.g. `language`\n\n, `author`\n\n) are omitted from the YAML.\n\nA compact overview — see `pagetomd --help`\n\nfor the full list.\n\n| Flag | Default | Description |\n|---|---|---|\n`--output / -o` |\nderived from title |\nOutput path, or `-` for stdout. |\n`--overwrite` |\n`false` |\nReplace an existing destination file. |\n`--follow-symlinks / --no-follow-symlinks` |\n`false` |\nAllow writes to a symlinked destination. Off by default so `--overwrite` cannot be tricked into clobbering a file outside the intended directory via a symlink. |\n`--fetcher` |\n`httpx` |\n`httpx` , `playwright` , or `auto` . |\n`--timeout` |\n`30.0` |\nPer-request HTTP timeout (seconds). |\n`--retries` |\n`4` |\nPer-page retry attempts on transient failures (default 4 = up to 5 total attempts). Honours the server's `Retry-After` header on 429/503 responses, capped at 5 minutes; falls back to exponential backoff otherwise. |\n`--user-agent` |\n`pagetomd/<ver>` |\nOverride the outbound `User-Agent` . |\n`--no-verify-ssl` |\n`false` |\nDisable TLS certificate verification (for corporate proxies that re-sign HTTPS). |\n`--respect-robots / --no-respect-robots` |\n`true` |\nHonour `robots.txt` (relaxed for loopback/RFC 1918). |\n`--max-redirects` |\n`10` |\nCap on the redirect chain length. |\n`--include-comments / --no-include-comments` |\n`false` |\nPreserve HTML comments in the extracted document. |\n`--include-images / --no-include-images` |\n`true` |\nKeep image syntax in output. |\n`--include-links / --no-include-links` |\n`true` |\nKeep link URLs in output. |\n`--heading-style` |\n`atx` |\n`atx` (`#` ) or `setext` (`===` ). |\n`--code-fences / --no-code-fences` |\n`true` |\nUse fenced code blocks instead of indented ones. |\n`--wide-tables` |\n`kv` |\nWide-table strategy: `kv` , `html` , or `drop` . |\n`--no-fetched-at` |\n`false` |\nOmit `fetched_at` for byte-deterministic output. |\n`--log-level` |\n`info` |\n`debug` , `info` , `warning` , `error` . |\n`--log-json` |\n`false` |\nEmit logs as JSON lines on stderr. |\n`--debug` |\n`false` |\nShortcut for `--log-level=debug` + tracebacks on error. |\n`--playwright-idle-ms` |\n`500` |\nExtra wait (ms) after networkidle for late-firing scripts (Playwright fetcher only). |\n`--crawl` |\n`false` |\nCrawl all linked sub-pages under the seed URL's path prefix and write one `.md` file per page. Requires `-o` to be a directory. |\n`--crawl-depth` |\n`1` |\nMaximum BFS depth from the seed URL when `--crawl` is active. `0` = seed only. |\n`--retry-failed` / `--no-retry-failed` |\n`true` |\nAfter `--crawl` finishes, retry pages that failed in the initial pass once. |\n`--version` |\n— | Print the installed version and exit. |\n\nEvery flag has a `PAGETOMD_<UPPER_NAME>`\n\nequivalent. For example:\n\n```\nPAGETOMD_TIMEOUT=60 PAGETOMD_FETCHER=auto pagetomd https://example.com\n```\n\nCLI flags always override env vars; env vars override the built-in defaults.\n\n| Code | Meaning |\n|---|---|\n`0` |\nSuccess. |\n`1` |\nUnexpected internal error. |\n`2` |\nFetch failure (DNS, HTTP, robots.txt, redirect cap). |\n`3` |\nExtraction or conversion failure (empty body, malformed HTML). |\n`4` |\nOutput write failure (permissions, disk, atomic-rename clash). |\n`5` |\nMissing optional dependency (e.g. `playwright` not installed). |\n`64` |\nUsage or configuration error (bad flag, invalid value). |\n`130` |\nInterrupted by user (Ctrl-C). |\n\nOne paragraph plus a diagram of the pipeline:\n\n```\nURL ──► Fetcher ──► Extractor ──► Converter ──► Postprocess ──► Writer\n       (httpx /     (BS4 clean    (markdownify    (NFC, heading   (atomic\n        playwright)  + trafilatura) + GFM tables)  hierarchy,      file +\n                                                  URL absolutise)  YAML)\n```\n\nThe fetcher (`httpx`\n\nby default, `playwright`\n\nfor SPAs) downloads the page with retries and robots.txt enforcement. The extractor runs a BeautifulSoup pre-clean pass (strip scripts/styles/nav/ads) then hands the cleaned tree to `trafilatura`\n\nto identify main content and harvest metadata. The converter renders the surviving HTML to Markdown via a customised `markdownify`\n\nsubclass (ATX headings, fenced code blocks with language hints, GFM tables with wide-table fallbacks). The postprocessor enforces the AI-readiness contract (NFC, zero-width strip, monotonic heading hierarchy, absolute URLs). The writer prepends a YAML frontmatter block and writes atomically (or streams to stdout).\n\n`pagetomd`\n\nis a **public-URL-only** tool. It refuses to fetch private, loopback, link-local, multicast, reserved, or cloud-metadata addresses by default — and there is no flag to override that. Treat output files as having the same sensitivity as the URL they were fetched from.\n\nCI enforces both a project-wide test coverage floor of **85%** and a per-module floor of **90% (line + branch combined)** on the four critical modules — [ extractor](/gs202/PageToMD/blob/main/src/pagetomd/extractor.py),\n\n[,](/gs202/PageToMD/blob/main/src/pagetomd/converter.py)\n\n`converter`\n\n[, and](/gs202/PageToMD/blob/main/src/pagetomd/writer.py)\n\n`writer`\n\n[. These four carry the AI-readiness contract, so they get the strictest coverage bar.](/gs202/PageToMD/blob/main/src/pagetomd/postprocess.py)\n\n`postprocess`\n\n```\ngit clone https://github.com/gs202/PageToMD.git\ncd pagetomd\nuv sync --extra dev --extra playwright\npre-commit install\nuv run pytest\n```\n\nSee [ CONTRIBUTING.md](/gs202/PageToMD/blob/main/CONTRIBUTING.md) for the full contributor workflow.\n\nBusiness Source License 1.1 — source-available, free for non-commercial use. Converts to MIT on 2030-06-16. See [LICENSE](/gs202/PageToMD/blob/main/LICENSE) for full terms.", "url": "https://wpnews.pro/news/pagetomd-a-cli-tool-to-turn-web-pages-into-clean-markdown-for-ai-agents", "canonical_source": "https://github.com/gs202/PageToMD", "published_at": "2026-06-19 09:01:49+00:00", "updated_at": "2026-06-19 09:07:55.077884+00:00", "lang": "en", "topics": ["developer-tools", "ai-tools", "large-language-models"], "entities": ["PageToMD", "Playwright", "httpx", "Typer", "pandoc", "curl", "sed", "Chromium"], "alternates": {"html": "https://wpnews.pro/news/pagetomd-a-cli-tool-to-turn-web-pages-into-clean-markdown-for-ai-agents", "markdown": "https://wpnews.pro/news/pagetomd-a-cli-tool-to-turn-web-pages-into-clean-markdown-for-ai-agents.md", "text": "https://wpnews.pro/news/pagetomd-a-cli-tool-to-turn-web-pages-into-clean-markdown-for-ai-agents.txt", "jsonld": "https://wpnews.pro/news/pagetomd-a-cli-tool-to-turn-web-pages-into-clean-markdown-for-ai-agents.jsonld"}}