Two months ago, I was knee-deep in a project that sounded simple: build a system that could answer questions from our company’s internal documentation. We had hundreds of PDFs, Confluence pages, and READMEs. The goal was to let junior developers ask natural language questions and get accurate answers instantly.
I thought, “How hard can it be? I’ll just fine-tune a small LLM on our documents.”
Spoiler: it was that hard, and then some.
I spent two weeks collecting, cleaning, and chunking our documentation. I wrote a Hugging Face training script, rented a GPU, and fine-tuned a 7B parameter model. The result? A model that could recite our API docs verbatim but couldn’t answer a question like “Why does our auth flow fail for expired tokens?” without hallucinating.
Fine-tuning taught the model patterns in the text, but it didn’t give it the ability to retrieve specific facts. Plus, every time a document changed, I’d have to retrain. It was unsustainable.
Next, I tried Elasticsearch with a BM25 scorer. I’d split documents into chunks and search for keywords from the user’s question. The problem: natural language questions don’t map well to keywords. “How do I reset my password?” would match chunks about “reset” and “password”, but miss the critical steps for multi-factor auth. Recall was terrible.
After reading about RAG, I realized the solution wasn’t to train the model on my data — it was to give the model a way to look up the right data at query time. The core idea:
I’ll walk you through a working prototype using Python, OpenAI embeddings, and ChromaDB.
pip install chromadb openai tiktoken langchain langchain-community
For this example, I’ll use a small text file. In practice, you’d use a document from LangChain.
from langchain_community.document_s import Text
from langchain.text_splitter import RecursiveCharacterTextSplitter
= Text("my_docs.txt")
documents = .load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ".", "!"],
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
The overlap ensures that no context is lost at chunk boundaries.
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
vectordb = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory="./chroma_db"
)
vectordb.persist()
python
from langchain.chains import RetrievalQA
from langchain_community.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectordb.as_retriever(search_kwargs={"k": 4})
)
question = "How do I reset my password if I'm on a VPN?"
answer = qa_chain.invoke(question)
print(answer)
And that’s it. A working Q&A system in under 50 lines of code.
RAG isn’t magic — it has its own pain points:
all-MiniLM-L6-v2
from Sentence Transformers, but they’re less accurate.First, I’d start with a managed service that handles the embedding and retrieval infrastructure. For example, a platform like Interwest Info AI (https://ai.interwestinfo.com/) abstracts away the vector DB and chunking strategies — you just upload documents and get an API. That would have saved me two weeks of fiddling with ChromaDB quirks and scaling issues.
Second, I’d invest more time in evaluating retrieval quality before building the RAG pipeline. Create a small test set of 20 questions and manually verify which chunks should be retrieved. That tells you if your chunking and embedding model are up to par.
Building a document Q&A system from scratch taught me more about the trade-offs in retrieval than any blog post ever could. But now I’m curious: What’s your go-to approach for building a knowledge base chatbot? Are you DIY with LangChain, or do you use a SaaS platform? Let’s discuss in the comments.