Building an AI Chatbot for Internal Docs: What Worked (and What Didn't)

A developer's team spent two months building an AI chatbot for internal documentation using retrieval-augmented generation, encountering high costs with GPT-4 and embedding generation on 50,000 pages of docs. Switching to a local 7B model via Ollama eliminated API costs but caused accuracy drops, particularly with domain-specific jargon. The solution required semantic chunking with spaCy, a hybrid retriever combining vector search and BM25, and a cross-encoder reranking step to improve retrieval precision.

A few months ago, my team hit a wall: our internal documentation had grown into a chaotic jungle of Confluence pages, Google Docs, and Slack threads. New hires took weeks just to find basic answers. I thought, "Let's build an AI chatbot." Simple, right? Spoiler: It took two months of trial and error. But I learned a ton about retrieval-augmented generation RAG - and what actually makes it work in production. We needed a system where a user could ask "How do I reset my VPN password?" and get a concise answer with a citation. Not a summary of our entire policy, not a hallucinated step - just the exact procedure. I grabbed every doc, chunked them into 500-character pieces with 50-character overlap, generated embeddings with text-embedding-ada-002 , and loaded them into a Pinecone index. Then I used GPT-4 to answer based on retrieved chunks. It worked - on the first query. Then the bill came. We had 50,000 pages of docs. Embedding generation alone cost hundreds of dollars. And GPT-4 usage? Let’s just say the CTO started asking questions. I switched to LlamaIndex with sentence-transformers/all-MiniLM-L6-v2 for embeddings and a local 7B model via Ollama. No API costs But accuracy dropped drastically. The local model couldn't follow complex instructions, and the embeddings failed to capture domain-specific jargon like "VPN PAM" or "SAML SSO" . After banging my head against recall and latency, I realized the answer wasn't a better model - it was better retrieval. Fixed chunk sizes don't work for technical docs. A 200-word chunk might cut a table in half. I switched to semantic chunking using spaCy sentence boundaries, then merged sentences until they exceeded 250 tokens. This preserved context. python import spacy from langchain.text splitter import RecursiveCharacterTextSplitter nlp = spacy.load "en core web sm" def semantic chunker text, max tokens=250 : doc = nlp text sentences = sent.text for sent in doc.sents merge sentences into chunks of ~max tokens chunk = token count = 0 for sent in sentences: tokens = len sent.split if token count + tokens max tokens and chunk: yield " ".join chunk chunk = sent token count = tokens else: chunk.append sent token count += tokens if chunk: yield " ".join chunk Vector search is great for synonyms and concepts, but terrible for exact matches like "VPN password reset". BM25 catches exact keywords but misses semantic similarity. Together, they're gold. I built a simple hybrid retriever: python from sentence transformers import SentenceTransformer from rank bm25 import BM25Okapi import numpy as np class HybridRetriever: def init self, chunks, embed model name="all-MiniLM-L6-v2" : self.chunks = chunks self.embedder = SentenceTransformer embed model name Vector index self.embeddings = self.embedder.encode chunks, show progress bar=True BM25 index tokenized = chunk.split for chunk in chunks self.bm25 = BM25Okapi tokenized def retrieve self, query, top k=5 : Vector scores q emb = self.embedder.encode query vec scores = np.dot self.embeddings, q emb.T .flatten BM25 scores bm25 scores = self.bm25.get scores query.split Normalize and combine equal weight vec scores = vec scores - vec scores.min / vec scores.max - vec scores.min + 1e-8 bm25 scores = bm25 scores - bm25 scores.min / bm25 scores.max - bm25 scores.min + 1e-8 combined = 0.5 vec scores + 0.5 bm25 scores top indices = np.argsort combined -top k: ::-1 return self.chunks i for i in top indices Then I added a reranking step using a cross-encoder cross-encoder/ms-marco-MiniLM-L-6-v2 . This scored the retrieved chunks against the query and reordered them. It added ~200ms but doubled the answer quality. I spent weeks tuning embedding models and chunk sizes. What finally made the system usable was a simple fallback: if the top-retrieved chunk had a confidence below a threshold, the bot replied "I couldn't find that in the docs. Try rephrasing or check the it-help channel." Users preferred honest "I don't know" over confident wrong answers. I should have started with a hybrid retriever from day one instead of chasing the "best" embedding model. Also, I wish I had set up a simple evaluation harness early - asking domain experts to label 100 queries and their ground-truth chunks. That would have saved weeks of guesswork. If you're building something similar, don't get seduced by the latest AI models. Your bottleneck is almost certainly retrieval, not generation. For a production-ready version, I later simplified things by using a managed retrieval service like the one at https://ai.interwestinfo.com/ , but building it from scratch taught me more. What's your experience with building internal AI tools? Did you hit the same chunking issues? Let me know in the comments