{"slug": "building-an-ai-chatbot-for-internal-docs-what-worked-and-what-didn-t", "title": "Building an AI Chatbot for Internal Docs: What Worked (and What Didn't)", "summary": "A developer's team spent two months building an AI chatbot for internal documentation using retrieval-augmented generation, encountering high costs with GPT-4 and embedding generation on 50,000 pages of docs. Switching to a local 7B model via Ollama eliminated API costs but caused accuracy drops, particularly with domain-specific jargon. The solution required semantic chunking with spaCy, a hybrid retriever combining vector search and BM25, and a cross-encoder reranking step to improve retrieval precision.", "body_md": "A few months ago, my team hit a wall: our internal documentation had grown into a chaotic jungle of Confluence pages, Google Docs, and Slack threads. New hires took weeks just to find basic answers. I thought, \"Let's build an AI chatbot.\" Simple, right?\n\n**Spoiler:** It took two months of trial and error. But I learned a ton about retrieval-augmented generation (RAG) - and what actually makes it work in production.\n\nWe needed a system where a user could ask \"How do I reset my VPN password?\" and get a concise answer with a citation. Not a summary of our entire policy, not a hallucinated step - just the exact procedure.\n\nI grabbed every doc, chunked them into 500-character pieces with 50-character overlap, generated embeddings with `text-embedding-ada-002`\n\n, and loaded them into a Pinecone index. Then I used GPT-4 to answer based on retrieved chunks.\n\nIt worked - on the first query. Then the bill came. We had 50,000 pages of docs. Embedding generation alone cost hundreds of dollars. And GPT-4 usage? Let’s just say the CTO started asking questions.\n\nI switched to `LlamaIndex`\n\nwith `sentence-transformers/all-MiniLM-L6-v2`\n\nfor embeddings and a local 7B model via Ollama. No API costs! But accuracy dropped drastically. The local model couldn't follow complex instructions, and the embeddings failed to capture domain-specific jargon (like \"VPN PAM\" or \"SAML SSO\").\n\nAfter banging my head against recall and latency, I realized the answer wasn't a better model - it was better retrieval.\n\nFixed chunk sizes don't work for technical docs. A 200-word chunk might cut a table in half. I switched to semantic chunking using spaCy sentence boundaries, then merged sentences until they exceeded 250 tokens. This preserved context.\n\n``` python\nimport spacy\nfrom langchain.text_splitter import RecursiveCharacterTextSplitter\n\nnlp = spacy.load(\"en_core_web_sm\")\n\ndef semantic_chunker(text, max_tokens=250):\n    doc = nlp(text)\n    sentences = [sent.text for sent in doc.sents]\n    # merge sentences into chunks of ~max_tokens\n    chunk = []\n    token_count = 0\n    for sent in sentences:\n        tokens = len(sent.split())\n        if token_count + tokens > max_tokens and chunk:\n            yield \" \".join(chunk)\n            chunk = [sent]\n            token_count = tokens\n        else:\n            chunk.append(sent)\n            token_count += tokens\n    if chunk:\n        yield \" \".join(chunk)\n```\n\nVector search is great for synonyms and concepts, but terrible for exact matches like \"VPN password reset\". BM25 catches exact keywords but misses semantic similarity. Together, they're gold.\n\nI built a simple hybrid retriever:\n\n``` python\nfrom sentence_transformers import SentenceTransformer\nfrom rank_bm25 import BM25Okapi\nimport numpy as np\n\nclass HybridRetriever:\n    def __init__(self, chunks, embed_model_name=\"all-MiniLM-L6-v2\"):\n        self.chunks = chunks\n        self.embedder = SentenceTransformer(embed_model_name)\n        # Vector index\n        self.embeddings = self.embedder.encode(chunks, show_progress_bar=True)\n        # BM25 index\n        tokenized = [chunk.split() for chunk in chunks]\n        self.bm25 = BM25Okapi(tokenized)\n\n    def retrieve(self, query, top_k=5):\n        # Vector scores\n        q_emb = self.embedder.encode([query])\n        vec_scores = np.dot(self.embeddings, q_emb.T).flatten()\n        # BM25 scores\n        bm25_scores = self.bm25.get_scores(query.split())\n        # Normalize and combine (equal weight)\n        vec_scores = (vec_scores - vec_scores.min()) / (vec_scores.max() - vec_scores.min() + 1e-8)\n        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-8)\n        combined = 0.5 * vec_scores + 0.5 * bm25_scores\n        top_indices = np.argsort(combined)[-top_k:][::-1]\n        return [self.chunks[i] for i in top_indices]\n```\n\nThen I added a reranking step using a cross-encoder (`cross-encoder/ms-marco-MiniLM-L-6-v2`\n\n). This scored the retrieved chunks against the query and reordered them. It added ~200ms but doubled the answer quality.\n\nI spent weeks tuning embedding models and chunk sizes. What finally made the system usable was a simple fallback: if the top-retrieved chunk had a confidence below a threshold, the bot replied \"I couldn't find that in the docs. Try rephrasing or check the #it-help channel.\" Users preferred honest \"I don't know\" over confident wrong answers.\n\nI should have started with a hybrid retriever from day one instead of chasing the \"best\" embedding model. Also, I wish I had set up a simple evaluation harness early - asking domain experts to label 100 queries and their ground-truth chunks. That would have saved weeks of guesswork.\n\nIf you're building something similar, don't get seduced by the latest AI models. Your bottleneck is almost certainly retrieval, not generation.\n\n*For a production-ready version, I later simplified things by using a managed retrieval service (like the one at https://ai.interwestinfo.com/), but building it from scratch taught me more.*\n\nWhat's your experience with building internal AI tools? Did you hit the same chunking issues? Let me know in the comments!", "url": "https://wpnews.pro/news/building-an-ai-chatbot-for-internal-docs-what-worked-and-what-didn-t", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/building-an-ai-chatbot-for-internal-docs-what-worked-and-what-didnt-4e26", "published_at": "2026-06-03 02:00:52+00:00", "updated_at": "2026-06-03 02:42:48.513314+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-tools", "ai-infrastructure", "natural-language-processing"], "entities": ["Confluence", "Google Docs", "Slack", "Pinecone", "GPT-4", "LlamaIndex", "Ollama", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/building-an-ai-chatbot-for-internal-docs-what-worked-and-what-didn-t", "markdown": "https://wpnews.pro/news/building-an-ai-chatbot-for-internal-docs-what-worked-and-what-didn-t.md", "text": "https://wpnews.pro/news/building-an-ai-chatbot-for-internal-docs-what-worked-and-what-didn-t.txt", "jsonld": "https://wpnews.pro/news/building-an-ai-chatbot-for-internal-docs-what-worked-and-what-didn-t.jsonld"}}