A few months ago, my team hit a wall: our internal documentation had grown into a chaotic jungle of Confluence pages, Google Docs, and Slack threads. New hires took weeks just to find basic answers. I thought, "Let's build an AI chatbot." Simple, right?
Spoiler: It took two months of trial and error. But I learned a ton about retrieval-augmented generation (RAG) - and what actually makes it work in production.
We needed a system where a user could ask "How do I reset my VPN password?" and get a concise answer with a citation. Not a summary of our entire policy, not a hallucinated step - just the exact procedure.
I grabbed every doc, chunked them into 500-character pieces with 50-character overlap, generated embeddings with text-embedding-ada-002
, and loaded them into a Pinecone index. Then I used GPT-4 to answer based on retrieved chunks.
It worked - on the first query. Then the bill came. We had 50,000 pages of docs. Embedding generation alone cost hundreds of dollars. And GPT-4 usage? Let’s just say the CTO started asking questions.
I switched to LlamaIndex
with sentence-transformers/all-MiniLM-L6-v2
for embeddings and a local 7B model via Ollama. No API costs! But accuracy dropped drastically. The local model couldn't follow complex instructions, and the embeddings failed to capture domain-specific jargon (like "VPN PAM" or "SAML SSO").
After banging my head against recall and latency, I realized the answer wasn't a better model - it was better retrieval.
Fixed chunk sizes don't work for technical docs. A 200-word chunk might cut a table in half. I switched to semantic chunking using spaCy sentence boundaries, then merged sentences until they exceeded 250 tokens. This preserved context.
import spacy
from langchain.text_splitter import RecursiveCharacterTextSplitter
nlp = spacy.load("en_core_web_sm")
def semantic_chunker(text, max_tokens=250):
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
chunk = []
token_count = 0
for sent in sentences:
tokens = len(sent.split())
if token_count + tokens > max_tokens and chunk:
yield " ".join(chunk)
chunk = [sent]
token_count = tokens
else:
chunk.append(sent)
token_count += tokens
if chunk:
yield " ".join(chunk)
Vector search is great for synonyms and concepts, but terrible for exact matches like "VPN password reset". BM25 catches exact keywords but misses semantic similarity. Together, they're gold.
I built a simple hybrid retriever:
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np
class HybridRetriever:
def __init__(self, chunks, embed_model_name="all-MiniLM-L6-v2"):
self.chunks = chunks
self.embedder = SentenceTransformer(embed_model_name)
self.embeddings = self.embedder.encode(chunks, show_progress_bar=True)
tokenized = [chunk.split() for chunk in chunks]
self.bm25 = BM25Okapi(tokenized)
def retrieve(self, query, top_k=5):
q_emb = self.embedder.encode([query])
vec_scores = np.dot(self.embeddings, q_emb.T).flatten()
bm25_scores = self.bm25.get_scores(query.split())
vec_scores = (vec_scores - vec_scores.min()) / (vec_scores.max() - vec_scores.min() + 1e-8)
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-8)
combined = 0.5 * vec_scores + 0.5 * bm25_scores
top_indices = np.argsort(combined)[-top_k:][::-1]
return [self.chunks[i] for i in top_indices]
Then I added a reranking step using a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2
). This scored the retrieved chunks against the query and reordered them. It added ~200ms but doubled the answer quality.
I spent weeks tuning embedding models and chunk sizes. What finally made the system usable was a simple fallback: if the top-retrieved chunk had a confidence below a threshold, the bot replied "I couldn't find that in the docs. Try rephrasing or check the #it-help channel." Users preferred honest "I don't know" over confident wrong answers.
I should have started with a hybrid retriever from day one instead of chasing the "best" embedding model. Also, I wish I had set up a simple evaluation harness early - asking domain experts to label 100 queries and their ground-truth chunks. That would have saved weeks of guesswork.
If you're building something similar, don't get seduced by the latest AI models. Your bottleneck is almost certainly retrieval, not generation.
For a production-ready version, I later simplified things by using a managed retrieval service (like the one at https://ai.interwestinfo.com/), but building it from scratch taught me more.
What's your experience with building internal AI tools? Did you hit the same chunking issues? Let me know in the comments!