Improving Local Techdocs for Your AI Coding Agent

A team building a self-improving knowledge base called Morsel developed a method to structure technical documentation for use by local AI coding agents. The process classifies pages using rules and a local LLM, embeds content with a sentence transformer, and builds a knowledge graph combining explicit hyperlinks with semantic similarity edges. This approach allows AI agents to filter out navigation, legal, and changelog pages and focus on actual content.

This is the third post in the series about making technical documentation available for use in your AI agent or knowledge base, based on our work on Morsel, a knowledge base that improves itself using AI agents /posts/ideation-and-product-ideas . In the first post /posts/make-technical-documentation-available-for-local-ai-use I described how we crawl documentation sites, clean the page content, and generate descriptions for images. In the second post /posts/learnings-from-crawling-technical-documentation I shared practical gotchas we ran into when crawling complete techdocs. Here I want to describe what we do afterwards to structure the crawled documentation further and make it available in a more useful form - for example, for your local AI coding agent. At a high level, we classify pages, embed them with a local model, and then build a knowledge graph that combines explicit hyperlinks with semantic similarity edges. Classifying Pages A lot of pages in technical documentation are not actual content. Many are purely navigation hubs - index pages that just link to other, more concrete pages. Others only show legal terms or changelogs. If you are doing further work on the data, you probably want to ignore most of those. We do a first rule-based pass to cover as much as possible fast, locally and cheap, without involving LLMs. To do so, we check the URL against a set of patterns for each category: LEGAL PATTERNS = "/legal/", "/privacy", "/terms", "/eula", "/cookie" def classify by rules url: str, title: str, content: str - str | None: if any p in url.lower for p in LEGAL PATTERNS : return "legal" ... same pattern for changelog, reference, etc. Navigation: short content with mostly links if len content.split < 200 and " " in content: return "navigation" return None needs LLM classification Everything the rules cannot classify gets sent to a local LLM. We give it the URL, title, the first 200 words of the page, and a list of its headings, then ask it to classify the page by primary intent. The classes are: conceptual , tutorial , how-to , example for the main content pages, based on the Diátaxis https://diataxis.fr/ framework for documentation , as well as structural pages navigation , reference , legal , changelog and others broken , misc . With this two-pass approach, the rule-based step handles the easy cases cheaply, and the LLM only sees what it actually needs to. The result is that we can filter out legal, changelog, navigation, and reference pages when doing further work on the data, and focus only on actual content pages. Embedding Pages With pages classified, the next step is embedding them. We use a local sentence transformer model to avoid API costs and make the process faster. So far, this seems okay for this use case. If a page exceeds the token limit, we split it at heading boundaries and average the resulting chunk embeddings: php def embed page content: str - list float : chunks = re.split r' ?m ^ {1,3} ', content if len chunks == 1: return model.encode content .tolist embeddings = model.encode chunk for chunk in chunks if chunk.strip avg = np.mean embeddings, axis=0 return avg / np.linalg.norm avg .tolist We embed from the cleaned markdown rather than plain text, because headings, code blocks, and list structure have semantic meaning that helps the model understand page structure. Building a Knowledge Graph With the embeddings, we can build a graph that includes semantic similarity in combination with the explicit hyperlinks in the documentation. We store two types of edges: Link edges explicit hyperlinks from page A to page B as directed edges with no weight and semantic edges between pages with high embedding similarity, we store two directed edges with the cosine similarity as the edge weight . We only include actual content pages in the semantic graph with pages classified as navigation, legal, reference etc. excluded. Additionally, we set a configurable similarity threshold and only add edges between pages with higher similarity than that currently 0.75 . Finally, we cap the number of neighbors per page currently 20 so you don’t end up with a few massively connected hubs. All of this page data, classifications, embeddings, and graph edges ends up in the same SQLite database. Do you need help with data science ? I can help and am available on a freelance basis : . Send me an Email ↗ mailto:philip@heltweg.org The Result The whole flow: you crawl the technical documentation as described in the second post /posts/learnings-from-crawling-technical-documentation , extract the page content and describe images as in the first post /posts/make-technical-documentation-available-for-local-ai-use , and then run these post-processing steps on the crawled data. What you end up with is a completely self-contained, local SQLite database with the full documentation stored in a form your AI can use easily. Agents can query it with plain SQL and models are quite good at that. They can use the classification to filter out noise and read only actual content pages, or for more fine-grained queries like “Show me all how-to pages related to topic X ”. And they can use the embeddings to find semantically similar pages, or navigate the knowledge graph using both the explicit hyperlinks written by the documentation authors and the implicit similarity edges we derived from the content. If you have built something similar or think this is interesting, I would love to hear about it. How are you using local documentation with your coding agents? What would you do differently?