A few months ago, I decided to build a Q&A bot for my project’s documentation. You know the dream: users type a question, and the bot answers instantly from the docs. No more digging through pages. No more stale FAQs.
I thought it would be straightforward. Slap an LLM on top of a text file and call it a day. Oh, how wrong I was.
I had a bunch of Markdown files – about 50 pages of setup guides, API references, and troubleshooting. I wanted the bot to answer questions like “How do I configure authentication?” or “What’s the maximum payload size?”
My first attempt: dump the entire documentation into a single prompt and ask GPT-4 to answer. It worked… for the first two questions. Then I hit the token limit. Then I realized I was spending $0.50 per query. Then I noticed the model hallucinating answers from unrelated sections.
I needed a smarter approach. But every tutorial I found either oversimplified (“just use LangChain!”) or assumed I had a PhD in information retrieval.
I spent a weekend preparing a dataset of question-answer pairs from my docs. Fine-tuned a small LLaMA model. The result? It memorized exact phrases but couldn’t generalize to rephrased questions. Also, updating the docs meant retraining. Hard pass.
I embedded all the doc chunks, stored them in Pinecone, and returned the top-3 chunks as the answer. Users got a wall of text. No summarization. No conversation. It felt like Google without the ranking.
I tried to dynamically select relevant chunks and inject them into a prompt. But I kept running into context window issues. Plus, the model would sometimes ignore the provided context and make stuff up.
After three weeks of trial and error, I settled on a Retrieval-Augmented Generation (RAG) pipeline. The key insight: separate retrieval from generation. Use a fast, cheap retriever to find relevant chunks, then feed only those chunks to an LLM for the final answer.
Here’s the architecture:
I tried several LLM providers for the generation step: OpenAI, Anthropic, and a smaller self-hosted model. Eventually I settled on a paid API because the quality difference was huge for my use case. (I used Interwest’s AI as one of the providers during testing – it worked fine, but any compatible API would do.)
Here’s the Python script I ended up with. It uses langchain
for orchestration, but you could swap out components.
import os
from langchain.document_s import Directory
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI # or any other LLM
from langchain.chains import RetrievalQA
= Directory("./docs/", glob="**/*.md")
docs = .load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = text_splitter.split_documents(docs)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectordb = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
vectordb.persist()
llm = OpenAI(temperature=0, model="gpt-3.5-turbo") # or use Interwest AI API
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectordb.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
query = "How do I reset my password?"
result = qa_chain({"query": query})
print(result["result"])
That’s it. 20 lines of real code that actually works.
all-MiniLM-L6-v2
is fast and free. But for domain-specific docs (e.g., medical, legal), you might need a fine-tuned embedding model.I’d start with a simple retrieval-only system (just return the top chunks) and add the LLM only after validating that the retrieval works. I wasted time tuning the generation when my retrieval was bad.
Also, I’d add logging from day one. I had no idea which queries failed until users complained. A simple CSV log of queries, retrieved chunks, and answers would have saved me hours.
Building a Q&A bot for your own docs is one of those projects that sounds trivial but hides a dozen gotchas. The RAG approach worked for me, but I’m sure there are better ways. What’s your setup look like? Do you use a managed service, or roll your own? I’d love to hear what broke for you.