{"slug": "how-to-build-a-rag-knowledge-base-from-any-documentation-site-in-5-minutes", "title": "How to Build a RAG Knowledge Base from Any Documentation Site in 5 Minutes", "summary": "A developer built an automated extraction and chunking pipeline that converts any documentation site into clean, structured markdown ready for vector stores. The pipeline, available as the RAG Docs Extractor on Apify, removes navigation, sidebars, and other noise, and outputs chunks with precomputed token counts using cl100k_base encoding. It integrates with LangChain and ChromaDB to create a retrieval-augmented generation (RAG) knowledge base in minutes.", "body_md": "You want to feed documentation into your RAG pipeline, but web scraping gives you a mess of navigation, sidebars, cookie banners, and broken formatting mixed with actual content. You spend hours cleaning up HTML before you can even start building your knowledge base.\n\nI built an automated extraction + chunking pipeline that converts any documentation site into clean, structured markdown ready for your vector store.\n\nUsing the [RAG Docs Extractor](https://apify.com/ambitious_door/ragdocs-extractor) on Apify, you can crawl any docs site and get chunked output with a single API call:\n\n```\n{\n  \"startUrl\": \"https://fastapi.tiangolo.com/\",\n  \"maxPages\": 100,\n  \"chunkByHeading\": true\n}\n```\n\nEach chunk in the output looks like:\n\n```\n{\n  \"url\": \"https://fastapi.tiangolo.com/tutorial/first-steps/\",\n  \"title\": \"First Steps - FastAPI\",\n  \"heading\": \"Create a FastAPI instance\",\n  \"content\": \"## Create a FastAPI instance\\n\\nThe simplest FastAPI file could look like this...\\n\\n```\n\npython\\nfrom fastapi import FastAPI\\n\\napp = FastAPI()\\n\n\n```\",\n  \"token_count\": 245\n}\n```\n\nNotice the `token_count`\n\nfield — it uses cl100k_base encoding (GPT-4 / modern embedding models), so you know exactly how many tokens each chunk costs before embedding.\n\nWith LangChain and ChromaDB:\n\n``` python\nfrom langchain_community.vectorstores import Chroma\nfrom langchain_openai import OpenAIEmbeddings\nfrom langchain.schema import Document\nimport json\n\n# Load the extracted chunks (from Apify dataset export)\nwith open(\"dataset.json\") as f:\n    chunks = json.load(f)\n\n# Convert to LangChain documents\ndocs = [\n    Document(\n        page_content=chunk[\"content\"],\n        metadata={\n            \"url\": chunk[\"url\"],\n            \"title\": chunk[\"title\"],\n            \"heading\": chunk.get(\"heading\", \"\"),\n            \"token_count\": chunk[\"token_count\"],\n        }\n    )\n    for chunk in chunks\n]\n\n# Create vector store\nvectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())\nprint(f\"Indexed {len(docs)} chunks\")\n```\n\nNo re-tokenization needed — the token counts are already computed.\n\n``` python\nfrom langchain_openai import ChatOpenAI\nfrom langchain.chains import RetrievalQA\n\nllm = ChatOpenAI(model=\"gpt-4\")\nqa = RetrievalQA.from_chain_type(\n    llm=llm,\n    retriever=vectorstore.as_retriever(search_kwargs={\"k\": 5}),\n)\n\nresult = qa.invoke(\"How do I add authentication to a FastAPI app?\")\nprint(result[\"result\"])\n```\n\nIf you just need to convert individual pages to markdown (no chunking), use [Website to Markdown](https://apify.com/ambitious_door/web-to-markdown) instead:\n\n```\n{\n  \"startUrl\": \"https://docs.python.org/3/library/asyncio.html\",\n  \"maxPages\": 1\n}\n```\n\nOutput is clean markdown with token counts. Good for when you want to control your own chunking strategy or feed single pages into an LLM context window.\n\nUnder the hood, the extractor:\n\n`<nav>`\n\n, `<footer>`\n\n, `.sidebar`\n\n, `.cookie-banner`\n\n, `<script>`\n\n, `<style>`\n\n, and 20+ other noise selectors`<article>`\n\n, `<main>`\n\n, `.markdown-body`\n\n, `.prose`\n\n, etc.The result is clean, structured content that's ready for any RAG pipeline.\n\nBoth are open on the Apify Store with pay-per-result pricing. No subscription needed.", "url": "https://wpnews.pro/news/how-to-build-a-rag-knowledge-base-from-any-documentation-site-in-5-minutes", "canonical_source": "https://dev.to/devtoolslab/how-to-build-a-rag-knowledge-base-from-any-documentation-site-in-5-minutes-3j76", "published_at": "2026-06-25 10:55:11+00:00", "updated_at": "2026-06-25 11:13:30.469459+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "natural-language-processing", "ai-infrastructure"], "entities": ["Apify", "LangChain", "ChromaDB", "OpenAI", "FastAPI", "GPT-4", "RAG Docs Extractor", "Website to Markdown"], "alternates": {"html": "https://wpnews.pro/news/how-to-build-a-rag-knowledge-base-from-any-documentation-site-in-5-minutes", "markdown": "https://wpnews.pro/news/how-to-build-a-rag-knowledge-base-from-any-documentation-site-in-5-minutes.md", "text": "https://wpnews.pro/news/how-to-build-a-rag-knowledge-base-from-any-documentation-site-in-5-minutes.txt", "jsonld": "https://wpnews.pro/news/how-to-build-a-rag-knowledge-base-from-any-documentation-site-in-5-minutes.jsonld"}}