# How to Build a RAG Knowledge Base from Any Documentation Site in 5 Minutes > Source: > Published: 2026-06-25 10:55:11+00:00 You want to feed documentation into your RAG pipeline, but web scraping gives you a mess of navigation, sidebars, cookie banners, and broken formatting mixed with actual content. You spend hours cleaning up HTML before you can even start building your knowledge base. I built an automated extraction + chunking pipeline that converts any documentation site into clean, structured markdown ready for your vector store. Using the [RAG Docs Extractor](https://apify.com/ambitious_door/ragdocs-extractor) on Apify, you can crawl any docs site and get chunked output with a single API call: ``` { "startUrl": "https://fastapi.tiangolo.com/", "maxPages": 100, "chunkByHeading": true } ``` Each chunk in the output looks like: ``` { "url": "https://fastapi.tiangolo.com/tutorial/first-steps/", "title": "First Steps - FastAPI", "heading": "Create a FastAPI instance", "content": "## Create a FastAPI instance\n\nThe simplest FastAPI file could look like this...\n\n``` python\nfrom fastapi import FastAPI\n\napp = FastAPI()\n ```", "token_count": 245 } ``` Notice the `token_count` field — it uses cl100k_base encoding (GPT-4 / modern embedding models), so you know exactly how many tokens each chunk costs before embedding. With LangChain and ChromaDB: ``` python from langchain_community.vectorstores import Chroma from langchain_openai import OpenAIEmbeddings from langchain.schema import Document import json # Load the extracted chunks (from Apify dataset export) with open("dataset.json") as f: chunks = json.load(f) # Convert to LangChain documents docs = [ Document( page_content=chunk["content"], metadata={ "url": chunk["url"], "title": chunk["title"], "heading": chunk.get("heading", ""), "token_count": chunk["token_count"], } ) for chunk in chunks ] # Create vector store vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings()) print(f"Indexed {len(docs)} chunks") ``` No re-tokenization needed — the token counts are already computed. ``` python from langchain_openai import ChatOpenAI from langchain.chains import RetrievalQA llm = ChatOpenAI(model="gpt-4") qa = RetrievalQA.from_chain_type( llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), ) result = qa.invoke("How do I add authentication to a FastAPI app?") print(result["result"]) ``` If you just need to convert individual pages to markdown (no chunking), use [Website to Markdown](https://apify.com/ambitious_door/web-to-markdown) instead: ``` { "startUrl": "https://docs.python.org/3/library/asyncio.html", "maxPages": 1 } ``` Output is clean markdown with token counts. Good for when you want to control your own chunking strategy or feed single pages into an LLM context window. Under the hood, the extractor: `