cd /news/large-language-models/i-built-a-q-a-bot-for-my-docs-and-al… · home topics large-language-models article
[ARTICLE · art-18271] src=dev.to pub= topic=large-language-models verified=true sentiment=↓ negative

I Built a Q&A Bot for My Docs and Almost Gave Up (Here's What Worked)

A developer built a Retrieval-Augmented Generation (RAG) pipeline for a documentation Q&A bot after multiple failed attempts, including token limits, high costs, and hallucination issues with direct LLM approaches. The final solution separates retrieval from generation, using a fast embedding model to find relevant document chunks before feeding them to an LLM for answers. The developer implemented the system in 20 lines of Python code using LangChain, Chroma for vector storage, and HuggingFace embeddings.

read3 min publishedMay 30, 2026

A few months ago, I decided to build a Q&A bot for my project’s documentation. You know the dream: users type a question, and the bot answers instantly from the docs. No more digging through pages. No more stale FAQs.

I thought it would be straightforward. Slap an LLM on top of a text file and call it a day. Oh, how wrong I was.

I had a bunch of Markdown files – about 50 pages of setup guides, API references, and troubleshooting. I wanted the bot to answer questions like “How do I configure authentication?” or “What’s the maximum payload size?”

My first attempt: dump the entire documentation into a single prompt and ask GPT-4 to answer. It worked… for the first two questions. Then I hit the token limit. Then I realized I was spending $0.50 per query. Then I noticed the model hallucinating answers from unrelated sections.

I needed a smarter approach. But every tutorial I found either oversimplified (“just use LangChain!”) or assumed I had a PhD in information retrieval.

I spent a weekend preparing a dataset of question-answer pairs from my docs. Fine-tuned a small LLaMA model. The result? It memorized exact phrases but couldn’t generalize to rephrased questions. Also, updating the docs meant retraining. Hard pass.

I embedded all the doc chunks, stored them in Pinecone, and returned the top-3 chunks as the answer. Users got a wall of text. No summarization. No conversation. It felt like Google without the ranking.

I tried to dynamically select relevant chunks and inject them into a prompt. But I kept running into context window issues. Plus, the model would sometimes ignore the provided context and make stuff up.

After three weeks of trial and error, I settled on a Retrieval-Augmented Generation (RAG) pipeline. The key insight: separate retrieval from generation. Use a fast, cheap retriever to find relevant chunks, then feed only those chunks to an LLM for the final answer.

Here’s the architecture:

I tried several LLM providers for the generation step: OpenAI, Anthropic, and a smaller self-hosted model. Eventually I settled on a paid API because the quality difference was huge for my use case. (I used Interwest’s AI as one of the providers during testing – it worked fine, but any compatible API would do.)

Here’s the Python script I ended up with. It uses langchain

for orchestration, but you could swap out components.

import os
from langchain.document_s import Directory
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI  # or any other LLM
from langchain.chains import RetrievalQA

 = Directory("./docs/", glob="**/*.md")
docs = .load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = text_splitter.split_documents(docs)

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectordb = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
vectordb.persist()

llm = OpenAI(temperature=0, model="gpt-3.5-turbo")  # or use Interwest AI API
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

query = "How do I reset my password?"
result = qa_chain({"query": query})
print(result["result"])

That’s it. 20 lines of real code that actually works.

all-MiniLM-L6-v2

is fast and free. But for domain-specific docs (e.g., medical, legal), you might need a fine-tuned embedding model.I’d start with a simple retrieval-only system (just return the top chunks) and add the LLM only after validating that the retrieval works. I wasted time tuning the generation when my retrieval was bad.

Also, I’d add logging from day one. I had no idea which queries failed until users complained. A simple CSV log of queries, retrieved chunks, and answers would have saved me hours.

Building a Q&A bot for your own docs is one of those projects that sounds trivial but hides a dozen gotchas. The RAG approach worked for me, but I’m sure there are better ways. What’s your setup look like? Do you use a managed service, or roll your own? I’d love to hear what broke for you.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-built-a-q-a-bot-fo…] indexed:0 read:3min 2026-05-30 ·