Learnings From Crawling Technical Documentation

A team building an AI-first knowledge base called Morsel documented practical techniques for crawling technical documentation at scale, including scoping crawls with URL prefixes, rendering JavaScript before extracting links, and handling PDF content. The developers shared five key learnings from their work to help others avoid common pitfalls when processing documentation for use in AI agents or local knowledge bases.

This is a second post in the series about making technical documentation available for use in your AI agent or knowledge base, concerning our work on Morsel, an AI-first knowledge base /posts/ideation-and-product-ideas . Crawling technical documentation is really helpful for vendoring documentation in your software engineering projects - for example, for local AI to offer your coding agents to use the documentation of software you use, or just to do further processing on it, as we do. This post is about the small gotchas we ran into when crawling at scale, so others doing similar work don’t hit the same problems. Our approach Our approach is quite pragmatic. At a high level, we have a Python script that takes an entry page of technical documentation. We crawl that entry point, extract every link from it, save the link, the full content of the page, a crawled-at date, and a status field in an SQLite database. We then enqueue every extracted link to be crawled in the future. With that, we are able to resume the crawl later and to share the results easily as one self-contained file. Do you need help with data science ? I can help and am available on a freelance basis : . Send me an Email ↗ mailto:philip@heltweg.org Learnings 1. Scoping and restricting the size of the crawl We restrict the scope of the crawling using two parameters. One is a --scope parameter, which is a URL prefix we stay within - by default, the origin of the entry page. This prevents us from following links that lead outside the documentation we are crawling. We also added an --exclude parameter. Any URL starting with that prefix is skipped, even if it is in scope. We added this because a lot of documentation has a default language most often English and then hosts translations under sub-paths - for example, /de/ for German. This way we can exclude translated parts of the documentation. 2. Let JavaScript render before extracting links Before crawling for links and content, you should let the whole content render and let JavaScript execute. We do this by waiting for networkidle using Playwright, which waits until there are no more than two network connections for at least 500ms. As described in the earlier blog post /posts/make-technical-documentation-available-for-local-ai-use , we also handle cookie and consent banners that would otherwise block the page content. We do this once at the entry URL using an LLM to identify the correct dismiss button from the list of visible button labels, and the resulting session state is then reused for the rest of the crawl. 3. Be careful with which links you extract Of course, you can use href attributes to extract links from the rendered content. Make sure to filter out links that start with mailto: or javascript: , because those are technically valid URLs, but you do not want to crawl them as they do not link to content . 4. Handling content that is not HTML Most relevant for documentation are PDF files, which we encountered quite a bit of. We crawl them the same way we crawl HTML pages, but if we encounter a PDF we download it and extract all the text using Python. That way we have title and text content in the same way we do for HTML pages, can discard the PDF file, and save the text and title in the database. php import io import httpx import pypdf def fetch pdf text url: str - tuple str, str, int : """Download a PDF and extract its text.""" response = httpx.get url, follow redirects=True, timeout=30 response.raise for status reader = pypdf.PdfReader io.BytesIO response.content title = reader.metadata.title if reader.metadata and reader.metadata.title else url.split "/" -1 pages text = page.extract text or "" for page in reader.pages text = "\n\n".join pages text .strip return title, text, response.status code 5. Normalize URLs We normalize URLs to deduplicate them, since urls with or without trailing slashes are often used interchangeably in docs. We use urlparse from Python’s urllib.parse to strip the fragment and remove trailing slashes: php from urllib.parse import urlparse def normalize url url: str - str: parsed = urlparse url normalized = parsed. replace fragment="" result = normalized.geturl if result.endswith "/" and len parsed.path 1: result = result.rstrip "/" return result 6. Make the script resumable and idempotent We save the status of every page we crawl: either rendered , error , or not yet visited. This lets us rerun the script any number of times. We ignore already- rendered pages, retry error pages, and continue visiting previously unvisited pages. The script can be safely rerun at any time. Full crawl loop The full crawl loop looks roughly like this with additional error handling, reporting etc removed : initialize DB and queue with entry URL retry any previously errored pages for each URL in queue: if URL ends with .pdf: download PDF, extract title and text else: render page with Playwright execute JS, dismiss cookie banners extract title, HTML, and text find all links on the page for each link: if in-scope, not excluded, and not yet visited - enqueue save result title, text, status to SQLite DB report final counts: total pages, rendered, errors At the end, every page has been visited, fully rendered, and saved in an SQLite database file. That file is available for further processing - it can be made available to your local coding agent to search through the documentation, or exported to Markdown or any format you want. For example, in an earlier blog post /posts/make-technical-documentation-available-for-local-ai-use I wrote about how we enrich individual pages by downloading and describing the images they contain. Do you need help with data science ? I can help and am available on a freelance basis : . Send me an Email ↗ mailto:philip@heltweg.org