I Built a Q&A Bot for My Docs and Almost Gave Up (Here's What Worked)

wpnews.pro

cd /news/large-language-models/i-built-a-q-a-bot-for-my-docs-and-al… · home › topics › large-language-models › article

[ARTICLE · art-18271] src=dev.to ↗ pub=2026-05-30T02:01Z topic=large-language-models verified=true sentiment=↓ negative

I Built a Q&A Bot for My Docs and Almost Gave Up (Here's What Worked)

A developer built a Retrieval-Augmented Generation (RAG) pipeline for a documentation Q&A bot after multiple failed attempts, including token limits, high costs, and hallucination issues with direct LLM approaches. The final solution separates retrieval from generation, using a fast embedding model to find relevant document chunks before feeding them to an LLM for answers. The developer implemented the system in 20 lines of Python code using LangChain, Chroma for vector storage, and HuggingFace embeddings.

read3 min views24 publishedMay 30, 2026

A few months ago, I decided to build a Q&A bot for my project’s documentation. You know the dream: users type a question, and the bot answers instantly from the docs. No more digging through pages. No more stale FAQs.

I thought it would be straightforward. Slap an LLM on top of a text file and call it a day. Oh, how wrong I was.

I had a bunch of Markdown files – about 50 pages of setup guides, API references, and troubleshooting. I wanted the bot to answer questions like “How do I configure authentication?” or “What’s the maximum payload size?”

My first attempt: dump the entire documentation into a single prompt and ask GPT-4 to answer. It worked… for the first two questions. Then I hit the token limit. Then I realized I was spending $0.50 per query. Then I noticed the model hallucinating answers from unrelated sections.

I needed a smarter approach. But every tutorial I found either oversimplified (“just use LangChain!”) or assumed I had a PhD in information retrieval.

I spent a weekend preparing a dataset of question-answer pairs from my docs. Fine-tuned a small LLaMA model. The result? It memorized exact phrases but couldn’t generalize to rephrased questions. Also, updating the docs meant retraining. Hard pass.

I embedded all the doc chunks, stored them in Pinecone, and returned the top-3 chunks as the answer. Users got a wall of text. No summarization. No conversation. It felt like Google without the ranking.

I tried to dynamically select relevant chunks and inject them into a prompt. But I kept running into context window issues. Plus, the model would sometimes ignore the provided context and make stuff up.

After three weeks of trial and error, I settled on a Retrieval-Augmented Generation (RAG) pipeline. The key insight: separate retrieval from generation. Use a fast, cheap retriever to find relevant chunks, then feed only those chunks to an LLM for the final answer.

Here’s the architecture:

I tried several LLM providers for the generation step: OpenAI, Anthropic, and a smaller self-hosted model. Eventually I settled on a paid API because the quality difference was huge for my use case. (I used Interwest’s AI as one of the providers during testing – it worked fine, but any compatible API would do.)

Here’s the Python script I ended up with. It uses langchain

for orchestration, but you could swap out components.

import os
from langchain.document_s import Directory
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI  # or any other LLM
from langchain.chains import RetrievalQA

 = Directory("./docs/", glob="**/*.md")
docs = .load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = text_splitter.split_documents(docs)

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectordb = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
vectordb.persist()

llm = OpenAI(temperature=0, model="gpt-3.5-turbo")  # or use Interwest AI API
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

query = "How do I reset my password?"
result = qa_chain({"query": query})
print(result["result"])

That’s it. 20 lines of real code that actually works.

all-MiniLM-L6-v2

is fast and free. But for domain-specific docs (e.g., medical, legal), you might need a fine-tuned embedding model.I’d start with a simple retrieval-only system (just return the top chunks) and add the LLM only after validating that the retrieval works. I wasted time tuning the generation when my retrieval was bad.

Also, I’d add logging from day one. I had no idea which queries failed until users complained. A simple CSV log of queries, retrieved chunks, and answers would have saved me hours.

Building a Q&A bot for your own docs is one of those projects that sounds trivial but hides a dozen gotchas. The RAG approach worked for me, but I’m sure there are better ways. What’s your setup look like? Do you use a managed service, or roll your own? I’d love to hear what broke for you.

source & further reading

dev.to — original article Stop AI Video Pipelines Before a Bad Render Gets Expensive hallint Update: What We Fixed, What We Shipped, and What's Coming in v0.2 Manticore Search 28.4.4: Faster KNN, better conversational search, easier installs and more faceting controls

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-built-a-q-a-bot-for-my…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/i-built-a-qa-bot-f…

mentioned entities

GPT-4

LangChain

LLaMA

Pinecone

Google

metadata

slugi-built-a-q-a-bot-for-my-docs-and-almost-gave-up-here-s-what-worked

topic#large-language-models

secondary4 topics

sentimentnegative

canonicaldev.to

navigation

← prevThe Real Moat Isn't Software

next →I got tired of sending resumes i…

── more in #large-language-models 4 stories · sorted by recency

searchenginejournal.com · 14 Jul · #large-language-models

Scaled AI Content Often Fails & Google’s Crawl Economics Explain Why

ca.finance.yahoo.com · 14 Jul · #large-language-models

AI startup Reflection signs over $1 billion computing deal with Nebius

androidauthority.com · 14 Jul · #large-language-models

Gemini in Chrome is expanding to even more desktop users

voi.id · 14 Jul · #large-language-models

Adopsi Gemini di Asia Tenggara Melesat, Mayoritas Pengguna di Bawah 25 Tahun

── more on @gpt-4 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required