How I stopped dumping PDFs and started chatting with my documentation

wpnews.pro

cd /news/large-language-models/how-i-stopped-dumping-pdfs-and-start… · home › topics › large-language-models › article

[ARTICLE · art-25858] src=dev.to ↗ pub=2026-06-13T02:00Z topic=large-language-models verified=true sentiment=↑ positive

How I stopped dumping PDFs and started chatting with my documentation

A developer built a RAG (Retrieval-Augmented Generation) system to let their team chat with internal documentation, replacing a wiki search that failed to answer common questions. After experimenting with chunking, embedding models, and retrieval strategies, they settled on a hybrid approach combining dense and sparse retrieval with a cross-encoder reranker. The system reduced Slack repetitions by 70% for their team of 20.

read3 min views17 publishedJun 13, 2026

A few months ago I was drowning in documentation. My team had written hundreds of pages about our internal microservices, configuration guides, and deployment procedures. Great, right? Except that nobody read them. The same questions popped up in Slack every week. "How do I reset the staging DB?" "What's the syntax for that webhook?"

I tried throwing a basic search index on top of the wiki. It was terrible. People would type "reset staging database" and get back a page about resetting production credentials. Context? Gone. Synonyms? Useless.

So I did what any developer would do: I spent two weekends building a RAG (Retrieval-Augmented Generation) system from scratch. Here’s what I learned, including the dead ends that wasted my time.

I started with the classic recipe: PDFs → text splitter → OpenAI embeddings → Pinecone. Simple. It worked... for one question. For everything else it returned irrelevant junk.

The problem was chunking. I used a fixed 512-token chunk size with no overlap. Sentences got chopped in half. Code blocks were ripped apart. The retrieval step found pieces of text that looked vector-similar but made no sense to the LLM.

I tried switching to a more advanced embedding model (text-embedding-3-large) and adding metadata filters. Still not great. The issue is that questions like "How do I reset staging DB?" require matching a verb (reset) and a noun (staging DB) with the relevant procedure. A single chunk rarely contained both the action and the target.

I also experimented with sliding window overlap and larger chunk sizes (1024 tokens). That helped a bit, but then the LLM would get distracted by too much context.

After reading a dozen blog posts and papers, I settled on a two-layer approach:

Here's the core retrieval function I ended up with:

import chromadb
from sentence_transformers import CrossEncoder

class HybridRetriever:
    def __init__(self, collection, bm25_index):
        self.collection = collection  # Chroma collection
        self.bm25 = bm25_index
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def retrieve(self, query, top_k=5):
        dense_results = self.collection.query(
            query_texts=[query],
            n_results=top_k * 2
        )
        sparse_results = self.bm25.search(query, top_k * 2)

        combined = {}
        for doc_id, score in dense_results:
            combined[doc_id] = combined.get(doc_id, 0) + (0.7 * score)
        for doc_id, score in sparse_results:
            combined[doc_id] = combined.get(doc_id, 0) + (0.3 * score)

        candidates = sorted(combined, key=lambda x: combined[x], reverse=True)[:top_k]
        texts = [self.collection.get(doc_id)['document'] for doc_id in candidates]
        cross_scores = self.reranker.predict([(query, t) for t in texts])

        final = sorted(zip(candidates, cross_scores), key=lambda x: x[1], reverse=True)
        return [doc_id for doc_id, _ in final]

This hybrid approach finally gave me consistently relevant chunks. The cross-encoder reranker is slow but I only run it on the top 10 candidates, so it's tolerable.

I'd start with LangChain or LlamaIndex instead of rolling my own pipeline. They handle lots of edge cases (like splitting code blocks, handling tables) that I spent days debugging. Also, I'd invest earlier in a good evaluation set – without a dozen test queries you'll never know if your changes are actually improving things.

The system is now running in production for our team of 20. We get about 50 questions per day, and I'm still tweaking the reranker threshold. It's not perfect – it fails on really vague questions – but it cut our Slack repetitions by 70%.

What chunking strategies have you found effective for technical documentation? I'm still learning.

source & further reading

dev.to — original article How do you measure something that gives a different answer every time? Mes premiers pas avec Linux et Git : comment j'ai préparé ma réunion CloudHer How I Made My AI CSV Import Pipeline Reliable by Adding Validation Layers 🚀

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-i-stopped-dumping-pd…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/how-i-stopped-dump…

mentioned entities

OpenAI

Pinecone

Chroma

LangChain

LlamaIndex

CrossEncoder

BM25

Slack

metadata

slughow-i-stopped-dumping-pdfs-and-started-chatting-with-my-documentation

topic#large-language-models

secondary2 topics

sentimentpositive

canonicaldev.to

navigation

← prevRanjan Roy: SpaceX’s pivot to AI…

next →Anthropic suspends new AI models…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 28 Jul · #large-language-models

The $3.2 Million Question: Calculating the True Cost of AI Vendor Lock-In

dev.to · 28 Jul · #large-language-models

AI agent architecture: components of a system that survives real traffic

promptcube3.com · 27 Jul · #large-language-models

how to build a RAG application from scratch

dev.to · 27 Jul · #large-language-models

🧠 Architect a Personalized Multi-Agent System with Long-Term Memory for Real Estate Tokenization

── more on @openai 3 stories trending now

wpnews · 16 Jul · #artificial-intelligence

Women entrepreneurs are less likely to leverage AI—but more likely to benefit from it

wpnews · 26 Jul · #ai-safety

University of Washington study reveals prompt injection risks lurking in AI agent memory

wpnews · 28 Jul · #artificial-intelligence

How Claude Code and VS Code turned Anthropic from a safety lab into a developer phenomenon

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required