RAG Pipeline for SRE Runbooks: 7 Vector Search Tips That Work

A developer building a RAG pipeline for SRE runbooks shares seven vector search tips based on production experience. Key recommendations include using domain-specific embedding models like BAAI/bge-small-en-v1.5, implementing metadata filtering before semantic search to reduce irrelevant results by ~60%, and employing hash-based change detection to keep vector stores fresh. The post also warns against common pitfalls such as splitting runbooks by fixed token count without respecting procedural boundaries and using Qdrant without authentication.

Originally published on kuryzhev.cloud Your on-call engineer gets paged at 2 AM and your RAG system confidently surfaces a runbook from six months ago — deprecated after the last migration, full of references to services that no longer exist. The engineer follows it anyway. That's the failure mode nobody talks about when they say "we RAG-ified our runbooks." Building a RAG pipeline for SRE runbooks that actually works in production means getting the embedding model, the index structure, the ingestion loop, and the retrieval quality all right at the same time. These seven tips are what I wish I'd known before our first on-call integration went sideways. Generic embedding models misread SRE jargon — domain matters more than benchmark scores. Terms like OOMKilled , CrashLoopBackOff , HighMemoryUsage , or your internal alert names are essentially invisible to models trained on general web text. They get embedded close to random technical noise rather than clustering with semantically related runbook content. I learned this after watching text-embedding-ada-002 confidently return a Kubernetes networking runbook for a PostgreSQL replication alert because both happened to mention "connection timeout." My current preference is BAAI/bge-small-en-v1.5 via sentence-transformers =2.7.0 . It produces 384-dimensional vectors, runs about 5x faster than ada-002 at inference time, and handles technical prose significantly better in practice. A single t3.medium can push roughly 50 embed requests per second — more than enough for alert-driven RAG queries, though you'll need batching for bulk re-indexing. If you need a hosted option and ada-002 is already in your stack, it's usable, but use distance: Dot in your Qdrant collection config for OpenAI vectors rather than Cosine — they're not interchangeable. One chunking detail that trips people up: don't split runbooks by fixed token count without respecting procedural step boundaries. Splitting "Step 3: drain the node" across two chunks destroys the procedural context the retriever needs. Use 512-token chunks with 64-token overlap as a starting point — the overlap preserves continuity across step boundaries without ballooning your index size. Metadata filtering before semantic search cuts irrelevant results by ~60% — don't skip it. A pure vector search across your entire runbook corpus will always surface some plausible-but-wrong results. The fix isn't a better model — it's filtering. Before the semantic ranking even runs, filter by structured metadata fields that you already have: alert name , service , severity , on call team , and critically, last updated . That last field is the one most teams forget to store, and it's what lets you warn engineers when the best matching runbook is eight months stale. For the vector store itself, I use Qdrant https://qdrant.tech/documentation/ in production. Version 1.9.x added native sparse+dense hybrid search via the sparse vectors config, which gives you BM25 keyword matching combined with semantic similarity in a single query — genuinely useful when alert names are exact-match keywords. If you're evaluating alternatives: Weaviate v1.24+ has the generative-openai module built in, which is tempting, but it couples your retrieval and generation layers tightly and makes model swaps painful. Pinecone namespaces work well if you're already in that ecosystem and don't need hybrid search. Watch out for: Qdrant's default Docker image ships with zero authentication enabled. Always set the QDRANT environment variable and keep port SERVICE API KEY 6333 inside a private subnet. I've seen this misconfiguration in three separate internal tooling audits. Hash-based change detection keeps your vector store fresh without re-embedding everything on every run. The ingestion pipeline is where most RAG implementations get lazy and end up paying for it — either in stale data or in runaway embedding API costs. The pattern I use: store a sha256 of each document's content in Redis. On every pipeline run, compare the current hash. If it matches, skip re-embedding entirely. Only new or changed content hits the embedding model. For Git-based runbooks, enforce a path convention: docs/runbooks/{service}/{alert name}.md . This lets you extract service and alert name metadata directly from the file path without parsing file content — simpler and less error-prone. For Confluence, the REST API endpoint /wiki/rest/api/content?type=page&spaceKey=SRE works, and LangChain's ConfluenceLoader requires atlassian-python-api =3.41.0 gets you started fast. That said, I moved off it to a custom fetch — you get better metadata control and don't inherit LangChain's chunking decisions. Here's the full ingestion pipeline with hash-based deduplication and Redis embedding cache: rag ingest.py — Runbook ingestion pipeline with hash-based deduplication Deps: qdrant-client =1.9.0, sentence-transformers =2.7.0, python-dotenv, redis, tiktoken import os import hashlib import json from pathlib import Path from dotenv import load dotenv import redis from qdrant client import QdrantClient from qdrant client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue from sentence transformers import SentenceTransformer load dotenv --- Config --- QDRANT URL = os.getenv "QDRANT URL", "http://localhost:6333" QDRANT API KEY = os.getenv "QDRANT API KEY" COLLECTION NAME = "sre runbooks" EMBED MODEL = "BAAI/bge-small-en-v1.5" 384-dim, fast, good on technical text CHUNK SIZE = 512 tokens CHUNK OVERLAP = 64 token overlap to preserve step continuity SCORE THRESHOLD = 0.78 minimum cosine similarity to surface a result --- Clients --- redis client = redis.Redis host="localhost", port=6379, decode responses=True qdrant = QdrantClient url=QDRANT URL, api key=QDRANT API KEY model = SentenceTransformer EMBED MODEL def chunk text text: str, size: int = CHUNK SIZE, overlap: int = CHUNK OVERLAP - list str : """Split on word boundaries respecting overlap — avoids mid-step cuts.""" words = text.split chunks, i = , 0 while i < len words : chunk = " ".join words i:i + size chunks.append chunk i += size - overlap slide with overlap return chunks def embed with cache text: str - list float : """Return cached embedding or compute and store it.""" key = f"emb:v1:{hashlib.sha256 text.encode .hexdigest }" cached = redis client.get key if cached: return json.loads cached vector = model.encode text, normalize embeddings=True .tolist redis client.setex key, 604800, json.dumps vector TTL: 7 days return vector def ingest runbook filepath: Path : """Parse path for metadata, chunk content, upsert to Qdrant.""" Expected path: docs/runbooks/{service}/{alert name}.md parts = filepath.parts service = parts -2 if len parts = 2 else "unknown" alert name = filepath.stem filename without .md content = filepath.read text encoding="utf-8" doc hash = hashlib.sha256 content.encode .hexdigest Fast change detection via Redis — skip unchanged docs entirely hash key = f"doc hash:{filepath}" if redis client.get hash key == doc hash: print f" SKIP {filepath} unchanged" return chunks = chunk text content points = for idx, chunk in enumerate chunks : vector = embed with cache chunk point id = int hashlib.sha256 f"{filepath}:{idx}".encode .hexdigest :8 , 16 points.append PointStruct id=point id, vector=vector, payload={ "service": service, "alert name": alert name, "chunk index": idx, "source path": str filepath , "doc hash": doc hash, "text": chunk, } qdrant.upsert collection name=COLLECTION NAME, points=points redis client.set hash key, doc hash update change-detection cache print f" OK Ingested {len points } chunks from {filepath}" def ensure collection : """Create collection if it doesn't exist.""" existing = c.name for c in qdrant.get collections .collections if COLLECTION NAME not in existing: qdrant.create collection collection name=COLLECTION NAME, vectors config=VectorParams size=384, distance=Distance.COSINE , print f" INIT Created collection: {COLLECTION NAME}" if name == " main ": ensure collection runbook dir = Path "docs/runbooks" for md file in runbook dir.rglob " .md" : ingest runbook md file Surface runbook context automatically when an alert fires — not only when someone thinks to ask. The real value of a RAG pipeline for SRE runbooks isn't a chat interface. It's injecting relevant procedure context into the incident notification itself, before the engineer even opens a terminal. The integration point is your Alertmanager or PagerDuty webhook. When a webhook fires, extract the alertname label Alertmanager v2 path: .alerts 0 .labels.alertname and use it as the query string to your RAG endpoint. One PagerDuty-specific gotcha: webhook v3 sends event.data.title as the incident name. Map this field, not event.id , to your query — I've seen this wired wrong in three different integrations and the resulting queries return garbage. Set a similarity score threshold of 0.78 with cosine distance as your starting point. Below that, return a "matched": false signal so your Slack notification can still fire — just without a runbook attachment. A "no confident match" message is far safer than surfacing a low-confidence wrong runbook. Return the top-3 chunks maximum; more than that and engineers stop reading them. Here's the FastAPI query endpoint wired to an Alertmanager webhook payload: rag query.py — Query endpoint wired to Alertmanager webhook Receives alert payload, returns top-3 runbook chunks above threshold import os from fastapi import FastAPI, Request, HTTPException from qdrant client import QdrantClient from qdrant client.models import Filter, FieldCondition, MatchValue from sentence transformers import SentenceTransformer QDRANT URL = os.getenv "QDRANT URL", "http://localhost:6333" QDRANT API KEY = os.getenv "QDRANT API KEY" COLLECTION NAME = "sre runbooks" SCORE THRESHOLD = 0.78 TOP K = 3 app = FastAPI qdrant = QdrantClient url=QDRANT URL, api key=QDRANT API KEY model = SentenceTransformer "BAAI/bge-small-en-v1.5" @app.post "/query/alert" async def query from alert request: Request : """ Accepts Alertmanager webhook JSON. Extracts alertname + service label, runs filtered vector search. Returns top-K chunks or a no-match signal. """ body = await request.json try: Alertmanager v2 webhook schema alert = body "alerts" 0 alert name = alert "labels" "alertname" e.g. "HighMemoryUsage" service = alert "labels" .get "service", None optional label except KeyError, IndexError : raise HTTPException status code=400, detail="Invalid Alertmanager payload" query text = f"{alert name} {service or ''}".strip query vector = model.encode query text, normalize embeddings=True .tolist Pre-filter by alert name metadata before semantic ranking search filter = Filter must= FieldCondition key="alert name", match=MatchValue value=alert name if alert name else None results = qdrant.search collection name=COLLECTION NAME, query vector=query vector, query filter=search filter, limit=TOP K, score threshold=SCORE THRESHOLD, drop low-confidence results with payload=True, if not results: Fallback: no confident match — Slack still pages, just without runbook return {"matched": False, "alert name": alert name, "chunks": } return { "matched": True, "alert name": alert name, "chunks": { "text": r.payload "text" , "source": r.payload "source path" , "score": round r.score, 4 , "chunk index": r.payload "chunk index" , } for r in results , } Example response: { "matched": true, "alert name": "HighMemoryUsage", "chunks": {"text": "Step 1: check OOMKilled pods with kubectl describe...", "source": "docs/runbooks/api/HighMemoryUsage.md", "score": 0.8912, "chunk index": 2} } For Slack delivery, use Block Kit's section block with a mrkdwn text field to render the runbook chunk inline alongside the alert details. Include the source path and score so engineers immediately know where it came from and how confident the match is. The silent failure mode is a RAG that returns plausible-but-wrong runbook steps with high confidence. Most teams evaluate their RAG pipeline by asking "does the LLM answer look right?" That's the wrong question. You need to evaluate whether the retrieved chunks were actually the correct runbook sections before any LLM even sees them. A well-phrased wrong answer is worse than an obvious failure. Build a golden dataset: 20-30 pairs of alert name, expected runbook section . Run recall@3 checks — does the correct chunk appear in the top 3 results? That's your baseline metric. For a more structured eval, the ragas library https://docs.ragas.io/en/stable/ v0.1.x provides context recall and answer relevancy metrics. Note that ragas requires openai =1.0.0 and makes separate LLM calls for scoring — budget for that API cost in your eval pipeline, it's not free. Run this eval gate on every significant change to the runbook corpus or after swapping embedding models. I caught a 15% recall drop after a Confluence space reorganization that changed page titles — the metadata-extracted alert name fields shifted, and the pre-filter was excluding correct results. Without the eval gate, that would have silently degraded on-call for weeks. Your vector store holds internal hostnames, escalation contacts, and credential patterns — treat it like production infrastructure. This is the access control gap I see most often. Teams move runbooks into a vector DB, wire up a query API, and mark it "internal only" as if that's sufficient. Runbooks regularly contain things like internal service hostnames, credential rotation procedures, escalation phone trees, and network topology details. If a service account with access to your RAG query API is compromised, an attacker can enumerate your entire operational playbook through semantic search. Enforce collection-level ACLs in Qdrant using per-collection API keys. In Weaviate, use RBAC to scope read access by team. Never expose the RAG query endpoint without authentication, even on an internal network — lateral movement from a compromised service is a real threat model, not a theoretical one. Watch out for: the Redis embedding cache also needs protection. Those cached vectors can be used to reconstruct approximate source text. Keep Redis on a private interface, require requirepass , and set appropriate bind directives. I stopped treating the cache layer as "just an optimization" after reading about embedding inversion attacks — they're not academic anymore. Also store last updated as a metadata field on every point. Without it, you have no way to surface a staleness warning to the on-call engineer when the best matching runbook is months old. This is a cheap field to add and an expensive oversight to fix after the fact. For more on securing internal tooling pipelines, see the patterns we cover at kuryzhev.cloud https://kuryzhev.cloud/ . Naive re-indexing pipelines multiply embedding costs fast — cache aggressively and schedule smart. At first glance, embedding costs look trivial. Five hundred runbook pages at roughly 10 chunks each, priced at text-embedding-ada-002 's $0.0001 per 1K tokens, works out to about $0.25 per full re-index. That sounds fine. But a naive pipeline that re-embeds everything on every CI merge, or that re-indexes when Confluence sends a webhook for a minor edit, turns that $0.25 into a daily charge. At scale with a self-hosted GPU model, it becomes compute time you're burning for no reason. The fix is two-layered. First, the Redis embedding cache with key pattern emb:v1:{sha256 chunk text } — identical chunk content across different documents or pipeline runs hits the cache, not the model. Include a version prefix v1 so that when you upgrade your embedding model, you can invalidate the entire cache cleanly by bumping to v2 without touching cache logic. Second, schedule full re-indexes weekly. Run incremental re-indexing changed documents only, via hash comparison on every merge to main . This keeps the index current without re-embedding stable content. One more cost lever: use gRPC instead of HTTP for Qdrant batch upserts. The default HTTP port is 6333 , gRPC is 6334 . Switching to gRPC gives approximately 30% lower latency on batch operations — not a cost saving directly, but it reduces the wall-clock time your ingestion job runs, which matters if you're paying for the compute that runs it.