How I stopped dumping PDFs and started chatting with my documentation

A developer built a RAG (Retrieval-Augmented Generation) system to let their team chat with internal documentation, replacing a wiki search that failed to answer common questions. After experimenting with chunking, embedding models, and retrieval strategies, they settled on a hybrid approach combining dense and sparse retrieval with a cross-encoder reranker. The system reduced Slack repetitions by 70% for their team of 20.

A few months ago I was drowning in documentation. My team had written hundreds of pages about our internal microservices, configuration guides, and deployment procedures. Great, right? Except that nobody read them. The same questions popped up in Slack every week. "How do I reset the staging DB?" "What's the syntax for that webhook?" I tried throwing a basic search index on top of the wiki. It was terrible. People would type "reset staging database" and get back a page about resetting production credentials. Context? Gone. Synonyms? Useless. So I did what any developer would do: I spent two weekends building a RAG Retrieval-Augmented Generation system from scratch. Here’s what I learned, including the dead ends that wasted my time. I started with the classic recipe: PDFs → text splitter → OpenAI embeddings → Pinecone. Simple. It worked... for one question. For everything else it returned irrelevant junk. The problem was chunking. I used a fixed 512-token chunk size with no overlap. Sentences got chopped in half. Code blocks were ripped apart. The retrieval step found pieces of text that looked vector-similar but made no sense to the LLM. I tried switching to a more advanced embedding model text-embedding-3-large and adding metadata filters. Still not great. The issue is that questions like "How do I reset staging DB?" require matching a verb reset and a noun staging DB with the relevant procedure. A single chunk rarely contained both the action and the target. I also experimented with sliding window overlap and larger chunk sizes 1024 tokens . That helped a bit, but then the LLM would get distracted by too much context. After reading a dozen blog posts and papers, I settled on a two-layer approach: Here's the core retrieval function I ended up with: python import chromadb from sentence transformers import CrossEncoder class HybridRetriever: def init self, collection, bm25 index : self.collection = collection Chroma collection self.bm25 = bm25 index self.reranker = CrossEncoder 'cross-encoder/ms-marco-MiniLM-L-6-v2' def retrieve self, query, top k=5 : dense retrieval embedding distance dense results = self.collection.query query texts= query , n results=top k 2 sparse retrieval BM25 sparse results = self.bm25.search query, top k 2 combine and deduplicate combined = {} for doc id, score in dense results: combined doc id = combined.get doc id, 0 + 0.7 score for doc id, score in sparse results: combined doc id = combined.get doc id, 0 + 0.3 score rerank with cross-encoder candidates = sorted combined, key=lambda x: combined x , reverse=True :top k texts = self.collection.get doc id 'document' for doc id in candidates cross scores = self.reranker.predict query, t for t in texts final = sorted zip candidates, cross scores , key=lambda x: x 1 , reverse=True return doc id for doc id, in final This hybrid approach finally gave me consistently relevant chunks. The cross-encoder reranker is slow but I only run it on the top 10 candidates, so it's tolerable. I'd start with LangChain or LlamaIndex instead of rolling my own pipeline. They handle lots of edge cases like splitting code blocks, handling tables that I spent days debugging. Also, I'd invest earlier in a good evaluation set – without a dozen test queries you'll never know if your changes are actually improving things. The system is now running in production for our team of 20. We get about 50 questions per day, and I'm still tweaking the reranker threshold. It's not perfect – it fails on really vague questions – but it cut our Slack repetitions by 70%. What chunking strategies have you found effective for technical documentation? I'm still learning.