# I Built a Q&A Bot for My Docs and Almost Gave Up (Here's What Worked)

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/i-built-a-qa-bot-for-my-docs-and-almost-gave-up-heres-what-worked-1kgj>
> Published: 2026-05-30 02:01:00+00:00

A few months ago, I decided to build a Q&A bot for my project’s documentation. You know the dream: users type a question, and the bot answers instantly from the docs. No more digging through pages. No more stale FAQs.

I thought it would be straightforward. Slap an LLM on top of a text file and call it a day. Oh, how wrong I was.

I had a bunch of Markdown files – about 50 pages of setup guides, API references, and troubleshooting. I wanted the bot to answer questions like “How do I configure authentication?” or “What’s the maximum payload size?”

My first attempt: dump the entire documentation into a single prompt and ask GPT-4 to answer. It worked… for the first two questions. Then I hit the token limit. Then I realized I was spending $0.50 per query. Then I noticed the model hallucinating answers from unrelated sections.

I needed a smarter approach. But every tutorial I found either oversimplified (“just use LangChain!”) or assumed I had a PhD in information retrieval.

I spent a weekend preparing a dataset of question-answer pairs from my docs. Fine-tuned a small LLaMA model. The result? It memorized exact phrases but couldn’t generalize to rephrased questions. Also, updating the docs meant retraining. Hard pass.

I embedded all the doc chunks, stored them in Pinecone, and returned the top-3 chunks as the answer. Users got a wall of text. No summarization. No conversation. It felt like Google without the ranking.

I tried to dynamically select relevant chunks and inject them into a prompt. But I kept running into context window issues. Plus, the model would sometimes ignore the provided context and make stuff up.

After three weeks of trial and error, I settled on a Retrieval-Augmented Generation (RAG) pipeline. The key insight: **separate retrieval from generation**. Use a fast, cheap retriever to find relevant chunks, then feed only those chunks to an LLM for the final answer.

Here’s the architecture:

I tried several LLM providers for the generation step: OpenAI, Anthropic, and a smaller self-hosted model. Eventually I settled on a paid API because the quality difference was huge for my use case. (I used [Interwest’s AI](https://ai.interwestinfo.com/) as one of the providers during testing – it worked fine, but any compatible API would do.)

Here’s the Python script I ended up with. It uses `langchain`

for orchestration, but you could swap out components.

``` python
import os
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI  # or any other LLM
from langchain.chains import RetrievalQA

# 1. Load documents
loader = DirectoryLoader("./docs/", glob="**/*.md")
docs = loader.load()

# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = text_splitter.split_documents(docs)

# 3. Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectordb = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
vectordb.persist()

# 4. Set up the QA chain
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")  # or use Interwest AI API
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# 5. Ask a question
query = "How do I reset my password?"
result = qa_chain({"query": query})
print(result["result"])
```

That’s it. 20 lines of real code that actually works.

`all-MiniLM-L6-v2`

is fast and free. But for domain-specific docs (e.g., medical, legal), you might need a fine-tuned embedding model.I’d start with a simple retrieval-only system (just return the top chunks) and add the LLM only after validating that the retrieval works. I wasted time tuning the generation when my retrieval was bad.

Also, I’d add logging from day one. I had no idea which queries failed until users complained. A simple CSV log of queries, retrieved chunks, and answers would have saved me hours.

Building a Q&A bot for your own docs is one of those projects that sounds trivial but hides a dozen gotchas. The RAG approach worked for me, but I’m sure there are better ways. What’s your setup look like? Do you use a managed service, or roll your own? I’d love to hear what broke for you.
