I Built a Q&A Bot for My Docs and Almost Gave Up (Here's What Worked)

A developer built a Retrieval-Augmented Generation (RAG) pipeline for a documentation Q&A bot after multiple failed attempts, including token limits, high costs, and hallucination issues with direct LLM approaches. The final solution separates retrieval from generation, using a fast embedding model to find relevant document chunks before feeding them to an LLM for answers. The developer implemented the system in 20 lines of Python code using LangChain, Chroma for vector storage, and HuggingFace embeddings.

A few months ago, I decided to build a Q&A bot for my project’s documentation. You know the dream: users type a question, and the bot answers instantly from the docs. No more digging through pages. No more stale FAQs. I thought it would be straightforward. Slap an LLM on top of a text file and call it a day. Oh, how wrong I was. I had a bunch of Markdown files – about 50 pages of setup guides, API references, and troubleshooting. I wanted the bot to answer questions like “How do I configure authentication?” or “What’s the maximum payload size?” My first attempt: dump the entire documentation into a single prompt and ask GPT-4 to answer. It worked… for the first two questions. Then I hit the token limit. Then I realized I was spending $0.50 per query. Then I noticed the model hallucinating answers from unrelated sections. I needed a smarter approach. But every tutorial I found either oversimplified “just use LangChain ” or assumed I had a PhD in information retrieval. I spent a weekend preparing a dataset of question-answer pairs from my docs. Fine-tuned a small LLaMA model. The result? It memorized exact phrases but couldn’t generalize to rephrased questions. Also, updating the docs meant retraining. Hard pass. I embedded all the doc chunks, stored them in Pinecone, and returned the top-3 chunks as the answer. Users got a wall of text. No summarization. No conversation. It felt like Google without the ranking. I tried to dynamically select relevant chunks and inject them into a prompt. But I kept running into context window issues. Plus, the model would sometimes ignore the provided context and make stuff up. After three weeks of trial and error, I settled on a Retrieval-Augmented Generation RAG pipeline. The key insight: separate retrieval from generation . Use a fast, cheap retriever to find relevant chunks, then feed only those chunks to an LLM for the final answer. Here’s the architecture: I tried several LLM providers for the generation step: OpenAI, Anthropic, and a smaller self-hosted model. Eventually I settled on a paid API because the quality difference was huge for my use case. I used Interwest’s AI https://ai.interwestinfo.com/ as one of the providers during testing – it worked fine, but any compatible API would do. Here’s the Python script I ended up with. It uses langchain for orchestration, but you could swap out components. python import os from langchain.document loaders import DirectoryLoader from langchain.text splitter import RecursiveCharacterTextSplitter from langchain.embeddings import HuggingFaceEmbeddings from langchain.vectorstores import Chroma from langchain.llms import OpenAI or any other LLM from langchain.chains import RetrievalQA 1. Load documents loader = DirectoryLoader "./docs/", glob=" / .md" docs = loader.load 2. Split into chunks text splitter = RecursiveCharacterTextSplitter chunk size=500, chunk overlap=50 chunks = text splitter.split documents docs 3. Create embeddings and vector store embeddings = HuggingFaceEmbeddings model name="all-MiniLM-L6-v2" vectordb = Chroma.from documents chunks, embeddings, persist directory="./chroma db" vectordb.persist 4. Set up the QA chain llm = OpenAI temperature=0, model="gpt-3.5-turbo" or use Interwest AI API qa chain = RetrievalQA.from chain type llm=llm, chain type="stuff", retriever=vectordb.as retriever search kwargs={"k": 3} , return source documents=True 5. Ask a question query = "How do I reset my password?" result = qa chain {"query": query} print result "result" That’s it. 20 lines of real code that actually works. all-MiniLM-L6-v2 is fast and free. But for domain-specific docs e.g., medical, legal , you might need a fine-tuned embedding model.I’d start with a simple retrieval-only system just return the top chunks and add the LLM only after validating that the retrieval works. I wasted time tuning the generation when my retrieval was bad. Also, I’d add logging from day one. I had no idea which queries failed until users complained. A simple CSV log of queries, retrieved chunks, and answers would have saved me hours. Building a Q&A bot for your own docs is one of those projects that sounds trivial but hides a dozen gotchas. The RAG approach worked for me, but I’m sure there are better ways. What’s your setup look like? Do you use a managed service, or roll your own? I’d love to hear what broke for you.