{"slug": "learnings-from-crawling-technical-documentation", "title": "Learnings From Crawling Technical Documentation", "summary": "A team building an AI-first knowledge base called Morsel documented practical techniques for crawling technical documentation at scale, including scoping crawls with URL prefixes, rendering JavaScript before extracting links, and handling PDF content. The developers shared five key learnings from their work to help others avoid common pitfalls when processing documentation for use in AI agents or local knowledge bases.", "body_md": "This is a second post in the series about making technical documentation available for use in your AI agent or knowledge base, concerning our work on [Morsel, an AI-first knowledge base](/posts/ideation-and-product-ideas). Crawling technical documentation is really helpful for vendoring documentation in your software engineering projects - for example, for local AI to offer your coding agents to use the documentation of software you use, or just to do further processing on it, as we do.\n\nThis post is about the small gotchas we ran into when crawling at scale, so others doing similar work don’t hit the same problems.\n\n## Our approach\n\nOur approach is quite pragmatic. At a high level, we have a Python script that takes an entry page of technical documentation. We crawl that entry point, extract every link from it, save the link, the full content of the page, a crawled-at date, and a status field in an SQLite database. We then enqueue every extracted link to be crawled in the future. With that, we are able to resume the crawl later and to share the results easily as one self-contained file.\n\nDo you need help with **data science**?\nI can help and am available on a freelance basis :).\n\n[Send me an Email ↗ ](mailto:philip@heltweg.org)\n\n## Learnings\n\n### 1. Scoping and restricting the size of the crawl\n\nWe restrict the scope of the crawling using two parameters. One is a `--scope`\n\nparameter, which is a URL prefix we stay within - by default, the origin of the entry page. This prevents us from following links that lead outside the documentation we are crawling.\n\nWe also added an `--exclude`\n\nparameter. Any URL starting with that prefix is skipped, even if it is in scope. We added this because a lot of documentation has a default language (most often English) and then hosts translations under sub-paths - for example, `/de/`\n\nfor German. This way we can exclude translated parts of the documentation.\n\n### 2. Let JavaScript render before extracting links\n\nBefore crawling for links and content, you should let the whole content render and let JavaScript execute. We do this by waiting for `networkidle`\n\nusing Playwright, which waits until there are no more than two network connections for at least 500ms. As described in the [earlier blog post](/posts/make-technical-documentation-available-for-local-ai-use), we also handle cookie and consent banners that would otherwise block the page content. We do this once at the entry URL using an LLM to identify the correct dismiss button from the list of visible button labels, and the resulting session state is then reused for the rest of the crawl.\n\n### 3. Be careful with which links you extract\n\nOf course, you can use `href`\n\nattributes to extract links from the rendered content. Make sure to filter out links that start with `mailto:`\n\nor `javascript:`\n\n, because those are technically valid URLs, but you do not want to crawl them (as they do not link to content).\n\n### 4. Handling content that is not HTML\n\nMost relevant for documentation are PDF files, which we encountered quite a bit of. We crawl them the same way we crawl HTML pages, but if we encounter a PDF we download it and extract all the text using Python. That way we have title and text content in the same way we do for HTML pages, can discard the PDF file, and save the text and title in the database.\n\n``` php\nimport io\nimport httpx\nimport pypdf\n\ndef fetch_pdf_text(url: str) -> tuple[str, str, int]:\n    \"\"\"Download a PDF and extract its text.\"\"\"\n    response = httpx.get(url, follow_redirects=True, timeout=30)\n    response.raise_for_status()\n    reader = pypdf.PdfReader(io.BytesIO(response.content))\n    title = reader.metadata.title if reader.metadata and reader.metadata.title else url.split(\"/\")[-1]\n    pages_text = [page.extract_text() or \"\" for page in reader.pages]\n    text = \"\\n\\n\".join(pages_text).strip()\n    return title, text, response.status_code\n```\n\n### 5. Normalize URLs\n\nWe normalize URLs to deduplicate them, since urls with or without trailing slashes are often used interchangeably in docs. We use `urlparse`\n\nfrom Python’s `urllib.parse`\n\nto strip the fragment and remove trailing slashes:\n\n``` php\nfrom urllib.parse import urlparse\n\ndef normalize_url(url: str) -> str:\n    parsed = urlparse(url)\n    normalized = parsed._replace(fragment=\"\")\n    result = normalized.geturl()\n    if result.endswith(\"/\") and len(parsed.path) > 1:\n        result = result.rstrip(\"/\")\n    return result\n```\n\n### 6. Make the script resumable and idempotent\n\nWe save the status of every page we crawl: either `rendered`\n\n, `error`\n\n, or not yet visited. This lets us rerun the script any number of times. We ignore already-`rendered`\n\npages, retry `error`\n\npages, and continue visiting previously unvisited pages. The script can be safely rerun at any time.\n\n## Full crawl loop\n\nThe full crawl loop looks roughly like this (with additional error handling, reporting etc removed):\n\n```\ninitialize DB and queue with entry URL\nretry any previously errored pages\n\nfor each URL in queue:\n    if URL ends with .pdf:\n        download PDF, extract title and text\n    else:\n        render page with Playwright (execute JS, dismiss cookie banners)\n        extract title, HTML, and text\n        find all links on the page\n        for each link: if in-scope, not excluded, and not yet visited -> enqueue\n\n    save result (title, text, status) to SQLite DB\n\nreport final counts: total pages, rendered, errors\n```\n\nAt the end, every page has been visited, fully rendered, and saved in an SQLite database file. That file is available for further processing - it can be made available to your local coding agent to search through the documentation, or exported to Markdown or any format you want. For example, in an [earlier blog post](/posts/make-technical-documentation-available-for-local-ai-use) I wrote about how we enrich individual pages by downloading and describing the images they contain.\n\nDo you need help with **data science**?\nI can help and am available on a freelance basis :).\n\n[Send me an Email ↗ ](mailto:philip@heltweg.org)", "url": "https://wpnews.pro/news/learnings-from-crawling-technical-documentation", "canonical_source": "https://heltweg.org/posts/learnings-from-crawling-technical-documentation/", "published_at": "2026-05-13 00:00:00+00:00", "updated_at": "2026-05-26 08:42:38.328808+00:00", "lang": "en", "topics": ["ai-tools", "ai-infrastructure", "ai-agents"], "entities": ["Morsel"], "alternates": {"html": "https://wpnews.pro/news/learnings-from-crawling-technical-documentation", "markdown": "https://wpnews.pro/news/learnings-from-crawling-technical-documentation.md", "text": "https://wpnews.pro/news/learnings-from-crawling-technical-documentation.txt", "jsonld": "https://wpnews.pro/news/learnings-from-crawling-technical-documentation.jsonld"}}