How to Build a RAG Knowledge Base from Any Documentation Site in 5 Minutes

wpnews.pro

cd /news/large-language-models/how-to-build-a-rag-knowledge-base-fr… · home › topics › large-language-models › article

[ARTICLE · art-39193] src=dev.to ↗ pub=2026-06-25T10:55Z topic=large-language-models verified=true sentiment=↑ positive

How to Build a RAG Knowledge Base from Any Documentation Site in 5 Minutes

A developer built an automated extraction and chunking pipeline that converts any documentation site into clean, structured markdown ready for vector stores. The pipeline, available as the RAG Docs Extractor on Apify, removes navigation, sidebars, and other noise, and outputs chunks with precomputed token counts using cl100k_base encoding. It integrates with LangChain and ChromaDB to create a retrieval-augmented generation (RAG) knowledge base in minutes.

read2 min views1 publishedJun 25, 2026

You want to feed documentation into your RAG pipeline, but web scraping gives you a mess of navigation, sidebars, cookie banners, and broken formatting mixed with actual content. You spend hours cleaning up HTML before you can even start building your knowledge base.

I built an automated extraction + chunking pipeline that converts any documentation site into clean, structured markdown ready for your vector store.

Using the RAG Docs Extractor on Apify, you can crawl any docs site and get chunked output with a single API call:

{
  "startUrl": "https://fastapi.tiangolo.com/",
  "maxPages": 100,
  "chunkByHeading": true
}

Each chunk in the output looks like:

{
  "url": "https://fastapi.tiangolo.com/tutorial/first-steps/",
  "title": "First Steps - FastAPI",
  "heading": "Create a FastAPI instance",
  "content": "## Create a FastAPI instance\n\nThe simplest FastAPI file could look like this...\n\n```

python\nfrom fastapi import FastAPI\n\napp = FastAPI()\n

```",
  "token_count": 245
}

Notice the token_count

field — it uses cl100k_base encoding (GPT-4 / modern embedding models), so you know exactly how many tokens each chunk costs before embedding.

With LangChain and ChromaDB:

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
import json

with open("dataset.json") as f:
    chunks = json.load(f)

docs = [
    Document(
        page_content=chunk["content"],
        metadata={
            "url": chunk["url"],
            "title": chunk["title"],
            "heading": chunk.get("heading", ""),
            "token_count": chunk["token_count"],
        }
    )
    for chunk in chunks
]

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
print(f"Indexed {len(docs)} chunks")

No re-tokenization needed — the token counts are already computed.

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4")
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
)

result = qa.invoke("How do I add authentication to a FastAPI app?")
print(result["result"])

If you just need to convert individual pages to markdown (no chunking), use Website to Markdown instead:

{
  "startUrl": "https://docs.python.org/3/library/asyncio.html",
  "maxPages": 1
}

Output is clean markdown with token counts. Good for when you want to control your own chunking strategy or feed single pages into an LLM context window.

Under the hood, the extractor:

<nav>

, <footer>

, .sidebar

, .cookie-banner

, <script>

, <style>

, and 20+ other noise selectors<article>

, <main>

, .markdown-body

, .prose

, etc.The result is clean, structured content that's ready for any RAG pipeline.

Both are open on the Apify Store with pay-per-result pricing. No subscription needed.

source & further reading

dev.to — original article Vibe-Memory: AI Semantic Memory That Fixes ChatGPT's Amnesia 7 Free In-Browser AI Prompt Engineering Tools (No Sign-Up, No Servers) AI Goes to War

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-to-build-a-rag-knowl…

Read original on dev.to → dev.to/devtoolslab/how-to-build-a-rag-knowledge-…

mentioned entities

Apify

LangChain

ChromaDB

OpenAI

FastAPI

GPT-4

RAG Docs Extractor

Website to Markdown

metadata

slughow-to-build-a-rag-knowledge-base-from-any-documentation-site-in-5-minutes

topic#large-language-models

secondary3 topics

sentimentpositive

canonicaldev.to

navigation

← prev1TB Sandisk NVMe SSD gets its fi…

next →Python vs PHP in 2026: An Honest…

── more in #large-language-models 4 stories · sorted by recency

letsdatascience.com · 25 Jun · #large-language-models

Sazabi raises $8 million for AI observability platform

dev.to · 25 Jun · #large-language-models

Vibe-Memory: AI Semantic Memory That Fixes ChatGPT's Amnesia

dev.to · 25 Jun · #large-language-models

An eval is just a test that returns a value

dev.to · 24 Jun · #large-language-models

Spec-First Engineering with Specmatic: Contract-Testing a Multi-Agent AI Assistant

── more on @apify 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required