RAG Pipeline for SRE Runbooks: 7 Vector Search Tips That Work

wpnews.pro

Originally published on kuryzhev.cloud

Your on-call engineer gets paged at 2 AM and your RAG system confidently surfaces a runbook from six months ago — deprecated after the last migration, full of references to services that no longer exist. The engineer follows it anyway. That's the failure mode nobody talks about when they say "we RAG-ified our runbooks." Building a RAG pipeline for SRE runbooks that actually works in production means getting the embedding model, the index structure, the ingestion loop, and the retrieval quality all right at the same time. These seven tips are what I wish I'd known before our first on-call integration went sideways.

Generic embedding models misread SRE jargon — domain matters more than benchmark scores.

Terms like OOMKilled

, CrashLoopBackOff

, HighMemoryUsage

, or your internal alert names are essentially invisible to models trained on general web text. They get embedded close to random technical noise rather than clustering with semantically related runbook content. I learned this after watching text-embedding-ada-002

confidently return a Kubernetes networking runbook for a PostgreSQL replication alert because both happened to mention "connection timeout."

My current preference is BAAI/bge-small-en-v1.5

via sentence-transformers>=2.7.0

. It produces 384-dimensional vectors, runs about 5x faster than ada-002 at inference time, and handles technical prose significantly better in practice. A single t3.medium

can push roughly 50 embed requests per second — more than enough for alert-driven RAG queries, though you'll need batching for bulk re-indexing. If you need a hosted option and ada-002 is already in your stack, it's usable, but use distance: Dot

in your Qdrant collection config for OpenAI vectors rather than Cosine — they're not interchangeable.

One chunking detail that trips people up: don't split runbooks by fixed token count without respecting procedural step boundaries. Splitting "Step 3: drain the node" across two chunks destroys the procedural context the retriever needs. Use 512-token chunks with 64-token overlap as a starting point — the overlap preserves continuity across step boundaries without ballooning your index size.

Metadata filtering before semantic search cuts irrelevant results by ~60% — don't skip it.

A pure vector search across your entire runbook corpus will always surface some plausible-but-wrong results. The fix isn't a better model — it's filtering. Before the semantic ranking even runs, filter by structured metadata fields that you already have: alert_name

, service

, severity

, on_call_team

, and critically, last_updated

. That last field is the one most teams forget to store, and it's what lets you warn engineers when the best matching runbook is eight months stale.

For the vector store itself, I use Qdrant in production. Version 1.9.x added native sparse+dense hybrid search via the sparse_vectors

config, which gives you BM25 keyword matching combined with semantic similarity in a single query — genuinely useful when alert names are exact-match keywords. If you're evaluating alternatives: Weaviate v1.24+ has the generative-openai

module built in, which is tempting, but it couples your retrieval and generation layers tightly and makes model swaps painful. Pinecone namespaces work well if you're already in that ecosystem and don't need hybrid search.

Watch out for: Qdrant's default Docker image ships with zero authentication enabled. Always set the QDRANT_

environment variable and keep port SERVICE_API_KEY6333

inside a private subnet. I've seen this misconfiguration in three separate internal tooling audits.

Hash-based change detection keeps your vector store fresh without re-embedding everything on every run.

The ingestion pipeline is where most RAG implementations get lazy and end up paying for it — either in stale data or in runaway embedding API costs. The pattern I use: store a sha256

of each document's content in Redis. On every pipeline run, compare the current hash. If it matches, skip re-embedding entirely. Only new or changed content hits the embedding model.

For Git-based runbooks, enforce a path convention: docs/runbooks/{service}/{alert_name}.md

. This lets you extract service

and alert_name

metadata directly from the file path without parsing file content — simpler and less error-prone. For Confluence, the REST API endpoint /wiki/rest/api/content?type=page&spaceKey=SRE

works, and LangChain's Confluence

(requires atlassian-python-api>=3.41.0

) gets you started fast. That said, I moved off it to a custom fetch — you get better metadata control and don't inherit LangChain's chunking decisions.

Here's the full ingestion pipeline with hash-based deduplication and Redis embedding cache:


import os
import hashlib
import json
from pathlib import Path
from dotenv import load_dotenv
import redis
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct, Filter,
    FieldCondition, MatchValue
)
from sentence_transformers import SentenceTransformer

load_dotenv()

QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
COLLECTION_NAME = "sre_runbooks"
EMBED_MODEL = "BAAI/bge-small-en-v1.5"   # 384-dim, fast, good on technical text
CHUNK_SIZE = 512        # tokens
CHUNK_OVERLAP = 64      # token overlap to preserve step continuity
SCORE_THRESHOLD = 0.78  # minimum cosine similarity to surface a result

redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)
qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
model = SentenceTransformer(EMBED_MODEL)

def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    """Split on word boundaries respecting overlap — avoids mid-step cuts."""
    words = text.split()
    chunks, i = [], 0
    while i < len(words):
        chunk = " ".join(words[i:i + size])
        chunks.append(chunk)
        i += size - overlap  # slide with overlap
    return chunks

def embed_with_cache(text: str) -> list[float]:
    """Return cached embedding or compute and store it."""
    key = f"emb:v1:{hashlib.sha256(text.encode()).hexdigest()}"
    cached = redis_client.get(key)
    if cached:
        return json.loads(cached)
    vector = model.encode(text, normalize_embeddings=True).tolist()
    redis_client.setex(key, 604800, json.dumps(vector))  # TTL: 7 days
    return vector

def ingest_runbook(filepath: Path):
    """Parse path for metadata, chunk content, upsert to Qdrant."""
    parts = filepath.parts
    service = parts[-2] if len(parts) >= 2 else "unknown"
    alert_name = filepath.stem  # filename without .md

    content = filepath.read_text(encoding="utf-8")
    doc_hash = hashlib.sha256(content.encode()).hexdigest()

    hash_key = f"doc_hash:{filepath}"
    if redis_client.get(hash_key) == doc_hash:
        print(f"[SKIP] {filepath} unchanged")
        return

    chunks = chunk_text(content)
    points = []
    for idx, chunk in enumerate(chunks):
        vector = embed_with_cache(chunk)
        point_id = int(hashlib.sha256(f"{filepath}:{idx}".encode()).hexdigest()[:8], 16)
        points.append(PointStruct(
            id=point_id,
            vector=vector,
            payload={
                "service": service,
                "alert_name": alert_name,
                "chunk_index": idx,
                "source_path": str(filepath),
                "doc_hash": doc_hash,
                "text": chunk,
            }
        ))

    qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
    redis_client.set(hash_key, doc_hash)  # update change-detection cache
    print(f"[OK] Ingested {len(points)} chunks from {filepath}")

def ensure_collection():
    """Create collection if it doesn't exist."""
    existing = [c.name for c in qdrant.get_collections().collections]
    if COLLECTION_NAME not in existing:
        qdrant.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=384, distance=Distance.COSINE),
        )
        print(f"[INIT] Created collection: {COLLECTION_NAME}")

if __name__ == "__main__":
    ensure_collection()
    runbook_dir = Path("docs/runbooks")
    for md_file in runbook_dir.rglob("*.md"):
        ingest_runbook(md_file)

Surface runbook context automatically when an alert fires — not only when someone thinks to ask.

The real value of a RAG pipeline for SRE runbooks isn't a chat interface. It's injecting relevant procedure context into the incident notification itself, before the engineer even opens a terminal. The integration point is your Alertmanager or PagerDuty webhook. When a webhook fires, extract the alertname

label (Alertmanager v2 path: .alerts[0].labels.alertname

) and use it as the query string to your RAG endpoint.

One PagerDuty-specific gotcha: webhook v3 sends event.data.title

as the incident name. Map this field, not event.id

, to your query — I've seen this wired wrong in three different integrations and the resulting queries return garbage.

Set a similarity score threshold of 0.78

with cosine distance as your starting point. Below that, return a "matched": false

signal so your Slack notification can still fire — just without a runbook attachment. A "no confident match" message is far safer than surfacing a low-confidence wrong runbook. Return the top-3 chunks maximum; more than that and engineers stop reading them.

Here's the FastAPI query endpoint wired to an Alertmanager webhook payload:


import os
from fastapi import FastAPI, Request, HTTPException
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
from sentence_transformers import SentenceTransformer

QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
COLLECTION_NAME = "sre_runbooks"
SCORE_THRESHOLD = 0.78
TOP_K = 3

app = FastAPI()
qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
model = SentenceTransformer("BAAI/bge-small-en-v1.5")

@app.post("/query/alert")
async def query_from_alert(request: Request):
    """
    Accepts Alertmanager webhook JSON.
    Extracts alertname + service label, runs filtered vector search.
    Returns top-K chunks or a no-match signal.
    """
    body = await request.json()

    try:
        alert = body["alerts"][0]
        alert_name = alert["labels"]["alertname"]       # e.g. "HighMemoryUsage"
        service = alert["labels"].get("service", None)  # optional label
    except (KeyError, IndexError):
        raise HTTPException(status_code=400, detail="Invalid Alertmanager payload")

    query_text = f"{alert_name} {service or ''}".strip()
    query_vector = model.encode(query_text, normalize_embeddings=True).tolist()

    search_filter = Filter(
        must=[FieldCondition(key="alert_name", match=MatchValue(value=alert_name))]
    ) if alert_name else None

    results = qdrant.search(
        collection_name=COLLECTION_NAME,
        query_vector=query_vector,
        query_filter=search_filter,
        limit=TOP_K,
        score_threshold=SCORE_THRESHOLD,  # drop low-confidence results
        with_payload=True,
    )

    if not results:
        return {"matched": False, "alert_name": alert_name, "chunks": []}

    return {
        "matched": True,
        "alert_name": alert_name,
        "chunks": [
            {
                "text": r.payload["text"],
                "source": r.payload["source_path"],
                "score": round(r.score, 4),
                "chunk_index": r.payload["chunk_index"],
            }
            for r in results
        ],
    }

For Slack delivery, use Block Kit's section

block with a mrkdwn

text field to render the runbook chunk inline alongside the alert details. Include the source_path

and score

so engineers immediately know where it came from and how confident the match is.

The silent failure mode is a RAG that returns plausible-but-wrong runbook steps with high confidence.

Most teams evaluate their RAG pipeline by asking "does the LLM answer look right?" That's the wrong question. You need to evaluate whether the retrieved chunks were actually the correct runbook sections before any LLM even sees them. A well-phrased wrong answer is worse than an obvious failure.

Build a golden dataset: 20-30 pairs of (alert_name, expected_runbook_section)

. Run recall@3 checks — does the correct chunk appear in the top 3 results? That's your baseline metric. For a more structured eval, the ragas library (v0.1.x) provides context_recall

and answer_relevancy

metrics. Note that ragas requires openai>=1.0.0

and makes separate LLM calls for scoring — budget for that API cost in your eval pipeline, it's not free.

Run this eval gate on every significant change to the runbook corpus or after swapping embedding models. I caught a 15% recall drop after a Confluence space reorganization that changed page titles — the metadata-extracted alert_name

fields shifted, and the pre-filter was excluding correct results. Without the eval gate, that would have silently degraded on-call for weeks.

Your vector store holds internal hostnames, escalation contacts, and credential patterns — treat it like production infrastructure.

This is the access control gap I see most often. Teams move runbooks into a vector DB, wire up a query API, and mark it "internal only" as if that's sufficient. Runbooks regularly contain things like internal service hostnames, credential rotation procedures, escalation phone trees, and network topology details. If a service account with access to your RAG query API is compromised, an attacker can enumerate your entire operational playbook through semantic search.

Enforce collection-level ACLs in Qdrant using per-collection API keys. In Weaviate, use RBAC to scope read access by team. Never expose the RAG query endpoint without authentication, even on an internal network — lateral movement from a compromised service is a real threat model, not a theoretical one.

Watch out for: the Redis embedding cache also needs protection. Those cached vectors can be used to reconstruct approximate source text. Keep Redis on a private interface, require requirepass

, and set appropriate bind

directives. I stopped treating the cache layer as "just an optimization" after reading about embedding inversion attacks — they're not academic anymore.

Also store last_updated

as a metadata field on every point. Without it, you have no way to surface a staleness warning to the on-call engineer when the best matching runbook is months old. This is a cheap field to add and an expensive oversight to fix after the fact. For more on securing internal tooling pipelines, see the patterns we cover at kuryzhev.cloud.

Naive re-indexing pipelines multiply embedding costs fast — cache aggressively and schedule smart.

At first glance, embedding costs look trivial. Five hundred runbook pages at roughly 10 chunks each, priced at text-embedding-ada-002

's $0.0001 per 1K tokens, works out to about $0.25 per full re-index. That sounds fine. But a naive pipeline that re-embeds everything on every CI merge, or that re-indexes when Confluence sends a webhook for a minor edit, turns that $0.25 into a daily charge. At scale with a self-hosted GPU model, it becomes compute time you're burning for no reason.

The fix is two-layered. First, the Redis embedding cache with key pattern emb:v1:{sha256(chunk_text)}

— identical chunk content across different documents or pipeline runs hits the cache, not the model. Include a version prefix (v1

) so that when you upgrade your embedding model, you can invalidate the entire cache cleanly by bumping to v2

without touching cache logic. Second, schedule full re-indexes weekly. Run incremental re-indexing (changed documents only, via hash comparison) on every merge to main

. This keeps the index current without re-embedding stable content.

One more cost lever: use gRPC instead of HTTP for Qdrant batch upserts. The default HTTP port is 6333

, gRPC is 6334

. Switching to gRPC gives approximately 30% lower latency on batch operations — not a cost saving directly, but it reduces the wall-clock time your ingestion job runs, which matters if you're paying for the compute that runs it.

source & further reading

dev.to — original article Stopping Runaway AI Loops: Implementing Enterprise FinOps and Observability with PolicyAware 50 headline prompts that don't sound like AI wrote them How I Decide What to Build Next at a One-Person Studio

RAG Pipeline for SRE Runbooks: 7 Vector Search Tips That Work

Run your AI side-project on zahid.host