Building an AI Chatbot for Internal Docs: What Worked (and What Didn't)

wpnews.pro

cd /news/artificial-intelligence/building-an-ai-chatbot-for-internal-… · home › topics › artificial-intelligence › article

[ARTICLE · art-19816] src=dev.to ↗ pub=2026-06-03T02:00Z topic=artificial-intelligence verified=true sentiment=· neutral

Building an AI Chatbot for Internal Docs: What Worked (and What Didn't)

A developer's team spent two months building an AI chatbot for internal documentation using retrieval-augmented generation, encountering high costs with GPT-4 and embedding generation on 50,000 pages of docs. Switching to a local 7B model via Ollama eliminated API costs but caused accuracy drops, particularly with domain-specific jargon. The solution required semantic chunking with spaCy, a hybrid retriever combining vector search and BM25, and a cross-encoder reranking step to improve retrieval precision.

read4 min views23 publishedJun 3, 2026

A few months ago, my team hit a wall: our internal documentation had grown into a chaotic jungle of Confluence pages, Google Docs, and Slack threads. New hires took weeks just to find basic answers. I thought, "Let's build an AI chatbot." Simple, right?

Spoiler: It took two months of trial and error. But I learned a ton about retrieval-augmented generation (RAG) - and what actually makes it work in production.

We needed a system where a user could ask "How do I reset my VPN password?" and get a concise answer with a citation. Not a summary of our entire policy, not a hallucinated step - just the exact procedure.

I grabbed every doc, chunked them into 500-character pieces with 50-character overlap, generated embeddings with text-embedding-ada-002

, and loaded them into a Pinecone index. Then I used GPT-4 to answer based on retrieved chunks.

It worked - on the first query. Then the bill came. We had 50,000 pages of docs. Embedding generation alone cost hundreds of dollars. And GPT-4 usage? Let’s just say the CTO started asking questions.

I switched to LlamaIndex

with sentence-transformers/all-MiniLM-L6-v2

for embeddings and a local 7B model via Ollama. No API costs! But accuracy dropped drastically. The local model couldn't follow complex instructions, and the embeddings failed to capture domain-specific jargon (like "VPN PAM" or "SAML SSO").

After banging my head against recall and latency, I realized the answer wasn't a better model - it was better retrieval.

Fixed chunk sizes don't work for technical docs. A 200-word chunk might cut a table in half. I switched to semantic chunking using spaCy sentence boundaries, then merged sentences until they exceeded 250 tokens. This preserved context.

import spacy
from langchain.text_splitter import RecursiveCharacterTextSplitter

nlp = spacy.load("en_core_web_sm")

def semantic_chunker(text, max_tokens=250):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    chunk = []
    token_count = 0
    for sent in sentences:
        tokens = len(sent.split())
        if token_count + tokens > max_tokens and chunk:
            yield " ".join(chunk)
            chunk = [sent]
            token_count = tokens
        else:
            chunk.append(sent)
            token_count += tokens
    if chunk:
        yield " ".join(chunk)

Vector search is great for synonyms and concepts, but terrible for exact matches like "VPN password reset". BM25 catches exact keywords but misses semantic similarity. Together, they're gold.

I built a simple hybrid retriever:

from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, chunks, embed_model_name="all-MiniLM-L6-v2"):
        self.chunks = chunks
        self.embedder = SentenceTransformer(embed_model_name)
        self.embeddings = self.embedder.encode(chunks, show_progress_bar=True)
        tokenized = [chunk.split() for chunk in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve(self, query, top_k=5):
        q_emb = self.embedder.encode([query])
        vec_scores = np.dot(self.embeddings, q_emb.T).flatten()
        bm25_scores = self.bm25.get_scores(query.split())
        vec_scores = (vec_scores - vec_scores.min()) / (vec_scores.max() - vec_scores.min() + 1e-8)
        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-8)
        combined = 0.5 * vec_scores + 0.5 * bm25_scores
        top_indices = np.argsort(combined)[-top_k:][::-1]
        return [self.chunks[i] for i in top_indices]

Then I added a reranking step using a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2

). This scored the retrieved chunks against the query and reordered them. It added ~200ms but doubled the answer quality.

I spent weeks tuning embedding models and chunk sizes. What finally made the system usable was a simple fallback: if the top-retrieved chunk had a confidence below a threshold, the bot replied "I couldn't find that in the docs. Try rephrasing or check the #it-help channel." Users preferred honest "I don't know" over confident wrong answers.

I should have started with a hybrid retriever from day one instead of chasing the "best" embedding model. Also, I wish I had set up a simple evaluation harness early - asking domain experts to label 100 queries and their ground-truth chunks. That would have saved weeks of guesswork.

If you're building something similar, don't get seduced by the latest AI models. Your bottleneck is almost certainly retrieval, not generation.

For a production-ready version, I later simplified things by using a managed retrieval service (like the one at https://ai.interwestinfo.com/), but building it from scratch taught me more.

What's your experience with building internal AI tools? Did you hit the same chunking issues? Let me know in the comments!

source & further reading

dev.to — original article Starting Google's 5-Day AI Vibe Coding Challenge 🚀 RAG - Semantic Caching Build a KYB agent in 20 lines no API key, the agent pays per call

~/api · this article 200

$curl api.wpnews.pro/v1/news/building-an-ai-chatbot-f…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/building-an-ai-cha…

mentioned entities

Confluence

Google Docs

Slack

Pinecone

GPT-4

LlamaIndex

Ollama

OpenAI

metadata

slugbuilding-an-ai-chatbot-for-internal-docs-what-worked-and-what-didn-t

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevAsk HN: A Brief History of LLMs

next →The AI Pilot-to-Production Gap I…

── more in #artificial-intelligence 4 stories · sorted by recency

github.com · 18 Jul · #artificial-intelligence

Show HN: Warden – authorization gateway for agentic RAG

startupfortune.com · 18 Jul · #artificial-intelligence

Yang Zhilin's Kimi K3 Forces OpenAI and Anthropic to Defend Their Pricing

dev.to · 18 Jul · #artificial-intelligence

AI Weekly: MCP Goes Stateless, Kimi K3, TSMC Records

blog.bytebytego.com · 18 Jul · #artificial-intelligence

MCP vs A2A vs ACP: How AI Agents Actually Talk to Each Other

── more on @confluence 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #large-language-models

Gemini 3.5 Pro Delayed to July 17: Architectural Rebuild Explained

wpnews · 8 Jul · #ai-chips

D-Matrix launches Corsair AI inference platform, challenging Nvidia’s GPU dominance

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required