A few months ago I was drowning in documentation. My team had written hundreds of pages about our internal microservices, configuration guides, and deployment procedures. Great, right? Except that nobody read them. The same questions popped up in Slack every week. "How do I reset the staging DB?" "What's the syntax for that webhook?"
I tried throwing a basic search index on top of the wiki. It was terrible. People would type "reset staging database" and get back a page about resetting production credentials. Context? Gone. Synonyms? Useless.
So I did what any developer would do: I spent two weekends building a RAG (Retrieval-Augmented Generation) system from scratch. Here’s what I learned, including the dead ends that wasted my time.
I started with the classic recipe: PDFs → text splitter → OpenAI embeddings → Pinecone. Simple. It worked... for one question. For everything else it returned irrelevant junk.
The problem was chunking. I used a fixed 512-token chunk size with no overlap. Sentences got chopped in half. Code blocks were ripped apart. The retrieval step found pieces of text that looked vector-similar but made no sense to the LLM.
I tried switching to a more advanced embedding model (text-embedding-3-large) and adding metadata filters. Still not great. The issue is that questions like "How do I reset staging DB?" require matching a verb (reset) and a noun (staging DB) with the relevant procedure. A single chunk rarely contained both the action and the target.
I also experimented with sliding window overlap and larger chunk sizes (1024 tokens). That helped a bit, but then the LLM would get distracted by too much context.
After reading a dozen blog posts and papers, I settled on a two-layer approach:
Here's the core retrieval function I ended up with:
import chromadb
from sentence_transformers import CrossEncoder
class HybridRetriever:
def __init__(self, collection, bm25_index):
self.collection = collection # Chroma collection
self.bm25 = bm25_index
self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def retrieve(self, query, top_k=5):
dense_results = self.collection.query(
query_texts=[query],
n_results=top_k * 2
)
sparse_results = self.bm25.search(query, top_k * 2)
combined = {}
for doc_id, score in dense_results:
combined[doc_id] = combined.get(doc_id, 0) + (0.7 * score)
for doc_id, score in sparse_results:
combined[doc_id] = combined.get(doc_id, 0) + (0.3 * score)
candidates = sorted(combined, key=lambda x: combined[x], reverse=True)[:top_k]
texts = [self.collection.get(doc_id)['document'] for doc_id in candidates]
cross_scores = self.reranker.predict([(query, t) for t in texts])
final = sorted(zip(candidates, cross_scores), key=lambda x: x[1], reverse=True)
return [doc_id for doc_id, _ in final]
This hybrid approach finally gave me consistently relevant chunks. The cross-encoder reranker is slow but I only run it on the top 10 candidates, so it's tolerable.
I'd start with LangChain or LlamaIndex instead of rolling my own pipeline. They handle lots of edge cases (like splitting code blocks, handling tables) that I spent days debugging. Also, I'd invest earlier in a good evaluation set – without a dozen test queries you'll never know if your changes are actually improving things.
The system is now running in production for our team of 20. We get about 50 questions per day, and I'm still tweaking the reranker threshold. It's not perfect – it fails on really vague questions – but it cut our Slack repetitions by 70%.
What chunking strategies have you found effective for technical documentation? I'm still learning.