# Building an AI Chatbot for Internal Docs: What Worked (and What Didn't)

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/building-an-ai-chatbot-for-internal-docs-what-worked-and-what-didnt-4e26>
> Published: 2026-06-03 02:00:52+00:00

A few months ago, my team hit a wall: our internal documentation had grown into a chaotic jungle of Confluence pages, Google Docs, and Slack threads. New hires took weeks just to find basic answers. I thought, "Let's build an AI chatbot." Simple, right?

**Spoiler:** It took two months of trial and error. But I learned a ton about retrieval-augmented generation (RAG) - and what actually makes it work in production.

We needed a system where a user could ask "How do I reset my VPN password?" and get a concise answer with a citation. Not a summary of our entire policy, not a hallucinated step - just the exact procedure.

I grabbed every doc, chunked them into 500-character pieces with 50-character overlap, generated embeddings with `text-embedding-ada-002`

, and loaded them into a Pinecone index. Then I used GPT-4 to answer based on retrieved chunks.

It worked - on the first query. Then the bill came. We had 50,000 pages of docs. Embedding generation alone cost hundreds of dollars. And GPT-4 usage? Let’s just say the CTO started asking questions.

I switched to `LlamaIndex`

with `sentence-transformers/all-MiniLM-L6-v2`

for embeddings and a local 7B model via Ollama. No API costs! But accuracy dropped drastically. The local model couldn't follow complex instructions, and the embeddings failed to capture domain-specific jargon (like "VPN PAM" or "SAML SSO").

After banging my head against recall and latency, I realized the answer wasn't a better model - it was better retrieval.

Fixed chunk sizes don't work for technical docs. A 200-word chunk might cut a table in half. I switched to semantic chunking using spaCy sentence boundaries, then merged sentences until they exceeded 250 tokens. This preserved context.

``` python
import spacy
from langchain.text_splitter import RecursiveCharacterTextSplitter

nlp = spacy.load("en_core_web_sm")

def semantic_chunker(text, max_tokens=250):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    # merge sentences into chunks of ~max_tokens
    chunk = []
    token_count = 0
    for sent in sentences:
        tokens = len(sent.split())
        if token_count + tokens > max_tokens and chunk:
            yield " ".join(chunk)
            chunk = [sent]
            token_count = tokens
        else:
            chunk.append(sent)
            token_count += tokens
    if chunk:
        yield " ".join(chunk)
```

Vector search is great for synonyms and concepts, but terrible for exact matches like "VPN password reset". BM25 catches exact keywords but misses semantic similarity. Together, they're gold.

I built a simple hybrid retriever:

``` python
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, chunks, embed_model_name="all-MiniLM-L6-v2"):
        self.chunks = chunks
        self.embedder = SentenceTransformer(embed_model_name)
        # Vector index
        self.embeddings = self.embedder.encode(chunks, show_progress_bar=True)
        # BM25 index
        tokenized = [chunk.split() for chunk in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve(self, query, top_k=5):
        # Vector scores
        q_emb = self.embedder.encode([query])
        vec_scores = np.dot(self.embeddings, q_emb.T).flatten()
        # BM25 scores
        bm25_scores = self.bm25.get_scores(query.split())
        # Normalize and combine (equal weight)
        vec_scores = (vec_scores - vec_scores.min()) / (vec_scores.max() - vec_scores.min() + 1e-8)
        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-8)
        combined = 0.5 * vec_scores + 0.5 * bm25_scores
        top_indices = np.argsort(combined)[-top_k:][::-1]
        return [self.chunks[i] for i in top_indices]
```

Then I added a reranking step using a cross-encoder (`cross-encoder/ms-marco-MiniLM-L-6-v2`

). This scored the retrieved chunks against the query and reordered them. It added ~200ms but doubled the answer quality.

I spent weeks tuning embedding models and chunk sizes. What finally made the system usable was a simple fallback: if the top-retrieved chunk had a confidence below a threshold, the bot replied "I couldn't find that in the docs. Try rephrasing or check the #it-help channel." Users preferred honest "I don't know" over confident wrong answers.

I should have started with a hybrid retriever from day one instead of chasing the "best" embedding model. Also, I wish I had set up a simple evaluation harness early - asking domain experts to label 100 queries and their ground-truth chunks. That would have saved weeks of guesswork.

If you're building something similar, don't get seduced by the latest AI models. Your bottleneck is almost certainly retrieval, not generation.

*For a production-ready version, I later simplified things by using a managed retrieval service (like the one at https://ai.interwestinfo.com/), but building it from scratch taught me more.*

What's your experience with building internal AI tools? Did you hit the same chunking issues? Let me know in the comments!
