Originally published on kuryzhev.cloud
Your on-call engineer gets paged at 2 AM and your RAG system confidently surfaces a runbook from six months ago β deprecated after the last migration, full of references to services that no longer exist. The engineer follows it anyway. That's the failure mode nobody talks about when they say "we RAG-ified our runbooks." Building a RAG pipeline for SRE runbooks that actually works in production means getting the embedding model, the index structure, the ingestion loop, and the retrieval quality all right at the same time. These seven tips are what I wish I'd known before our first on-call integration went sideways.
Generic embedding models misread SRE jargon β domain matters more than benchmark scores.
Terms like OOMKilled
, CrashLoopBackOff
, HighMemoryUsage
, or your internal alert names are essentially invisible to models trained on general web text. They get embedded close to random technical noise rather than clustering with semantically related runbook content. I learned this after watching text-embedding-ada-002
confidently return a Kubernetes networking runbook for a PostgreSQL replication alert because both happened to mention "connection timeout."
My current preference is BAAI/bge-small-en-v1.5
via sentence-transformers>=2.7.0
. It produces 384-dimensional vectors, runs about 5x faster than ada-002 at inference time, and handles technical prose significantly better in practice. A single t3.medium
can push roughly 50 embed requests per second β more than enough for alert-driven RAG queries, though you'll need batching for bulk re-indexing. If you need a hosted option and ada-002 is already in your stack, it's usable, but use distance: Dot
in your Qdrant collection config for OpenAI vectors rather than Cosine β they're not interchangeable.
One chunking detail that trips people up: don't split runbooks by fixed token count without respecting procedural step boundaries. Splitting "Step 3: drain the node" across two chunks destroys the procedural context the retriever needs. Use 512-token chunks with 64-token overlap as a starting point β the overlap preserves continuity across step boundaries without ballooning your index size.
Metadata filtering before semantic search cuts irrelevant results by ~60% β don't skip it.
A pure vector search across your entire runbook corpus will always surface some plausible-but-wrong results. The fix isn't a better model β it's filtering. Before the semantic ranking even runs, filter by structured metadata fields that you already have: alert_name
, service
, severity
, on_call_team
, and critically, last_updated
. That last field is the one most teams forget to store, and it's what lets you warn engineers when the best matching runbook is eight months stale.
For the vector store itself, I use Qdrant in production. Version 1.9.x added native sparse+dense hybrid search via the sparse_vectors
config, which gives you BM25 keyword matching combined with semantic similarity in a single query β genuinely useful when alert names are exact-match keywords. If you're evaluating alternatives: Weaviate v1.24+ has the generative-openai
module built in, which is tempting, but it couples your retrieval and generation layers tightly and makes model swaps painful. Pinecone namespaces work well if you're already in that ecosystem and don't need hybrid search.
Watch out for: Qdrant's default Docker image ships with zero authentication enabled. Always set the QDRANT_
environment variable and keep port SERVICE_API_KEY6333
inside a private subnet. I've seen this misconfiguration in three separate internal tooling audits.
Hash-based change detection keeps your vector store fresh without re-embedding everything on every run.
The ingestion pipeline is where most RAG implementations get lazy and end up paying for it β either in stale data or in runaway embedding API costs. The pattern I use: store a sha256
of each document's content in Redis. On every pipeline run, compare the current hash. If it matches, skip re-embedding entirely. Only new or changed content hits the embedding model.
For Git-based runbooks, enforce a path convention: docs/runbooks/{service}/{alert_name}.md
. This lets you extract service
and alert_name
metadata directly from the file path without parsing file content β simpler and less error-prone. For Confluence, the REST API endpoint /wiki/rest/api/content?type=page&spaceKey=SRE
works, and LangChain's Confluence
(requires atlassian-python-api>=3.41.0
) gets you started fast. That said, I moved off it to a custom fetch β you get better metadata control and don't inherit LangChain's chunking decisions.
Here's the full ingestion pipeline with hash-based deduplication and Redis embedding cache:
import os
import hashlib
import json
from pathlib import Path
from dotenv import load_dotenv
import redis
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct, Filter,
FieldCondition, MatchValue
)
from sentence_transformers import SentenceTransformer
load_dotenv()
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
COLLECTION_NAME = "sre_runbooks"
EMBED_MODEL = "BAAI/bge-small-en-v1.5" # 384-dim, fast, good on technical text
CHUNK_SIZE = 512 # tokens
CHUNK_OVERLAP = 64 # token overlap to preserve step continuity
SCORE_THRESHOLD = 0.78 # minimum cosine similarity to surface a result
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)
qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
model = SentenceTransformer(EMBED_MODEL)
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
"""Split on word boundaries respecting overlap β avoids mid-step cuts."""
words = text.split()
chunks, i = [], 0
while i < len(words):
chunk = " ".join(words[i:i + size])
chunks.append(chunk)
i += size - overlap # slide with overlap
return chunks
def embed_with_cache(text: str) -> list[float]:
"""Return cached embedding or compute and store it."""
key = f"emb:v1:{hashlib.sha256(text.encode()).hexdigest()}"
cached = redis_client.get(key)
if cached:
return json.loads(cached)
vector = model.encode(text, normalize_embeddings=True).tolist()
redis_client.setex(key, 604800, json.dumps(vector)) # TTL: 7 days
return vector
def ingest_runbook(filepath: Path):
"""Parse path for metadata, chunk content, upsert to Qdrant."""
parts = filepath.parts
service = parts[-2] if len(parts) >= 2 else "unknown"
alert_name = filepath.stem # filename without .md
content = filepath.read_text(encoding="utf-8")
doc_hash = hashlib.sha256(content.encode()).hexdigest()
hash_key = f"doc_hash:{filepath}"
if redis_client.get(hash_key) == doc_hash:
print(f"[SKIP] {filepath} unchanged")
return
chunks = chunk_text(content)
points = []
for idx, chunk in enumerate(chunks):
vector = embed_with_cache(chunk)
point_id = int(hashlib.sha256(f"{filepath}:{idx}".encode()).hexdigest()[:8], 16)
points.append(PointStruct(
id=point_id,
vector=vector,
payload={
"service": service,
"alert_name": alert_name,
"chunk_index": idx,
"source_path": str(filepath),
"doc_hash": doc_hash,
"text": chunk,
}
))
qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
redis_client.set(hash_key, doc_hash) # update change-detection cache
print(f"[OK] Ingested {len(points)} chunks from {filepath}")
def ensure_collection():
"""Create collection if it doesn't exist."""
existing = [c.name for c in qdrant.get_collections().collections]
if COLLECTION_NAME not in existing:
qdrant.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
print(f"[INIT] Created collection: {COLLECTION_NAME}")
if __name__ == "__main__":
ensure_collection()
runbook_dir = Path("docs/runbooks")
for md_file in runbook_dir.rglob("*.md"):
ingest_runbook(md_file)
Surface runbook context automatically when an alert fires β not only when someone thinks to ask.
The real value of a RAG pipeline for SRE runbooks isn't a chat interface. It's injecting relevant procedure context into the incident notification itself, before the engineer even opens a terminal. The integration point is your Alertmanager or PagerDuty webhook. When a webhook fires, extract the alertname
label (Alertmanager v2 path: .alerts[0].labels.alertname
) and use it as the query string to your RAG endpoint.
One PagerDuty-specific gotcha: webhook v3 sends event.data.title
as the incident name. Map this field, not event.id
, to your query β I've seen this wired wrong in three different integrations and the resulting queries return garbage.
Set a similarity score threshold of 0.78
with cosine distance as your starting point. Below that, return a "matched": false
signal so your Slack notification can still fire β just without a runbook attachment. A "no confident match" message is far safer than surfacing a low-confidence wrong runbook. Return the top-3 chunks maximum; more than that and engineers stop reading them.
Here's the FastAPI query endpoint wired to an Alertmanager webhook payload:
import os
from fastapi import FastAPI, Request, HTTPException
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
from sentence_transformers import SentenceTransformer
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
COLLECTION_NAME = "sre_runbooks"
SCORE_THRESHOLD = 0.78
TOP_K = 3
app = FastAPI()
qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
@app.post("/query/alert")
async def query_from_alert(request: Request):
"""
Accepts Alertmanager webhook JSON.
Extracts alertname + service label, runs filtered vector search.
Returns top-K chunks or a no-match signal.
"""
body = await request.json()
try:
alert = body["alerts"][0]
alert_name = alert["labels"]["alertname"] # e.g. "HighMemoryUsage"
service = alert["labels"].get("service", None) # optional label
except (KeyError, IndexError):
raise HTTPException(status_code=400, detail="Invalid Alertmanager payload")
query_text = f"{alert_name} {service or ''}".strip()
query_vector = model.encode(query_text, normalize_embeddings=True).tolist()
search_filter = Filter(
must=[FieldCondition(key="alert_name", match=MatchValue(value=alert_name))]
) if alert_name else None
results = qdrant.search(
collection_name=COLLECTION_NAME,
query_vector=query_vector,
query_filter=search_filter,
limit=TOP_K,
score_threshold=SCORE_THRESHOLD, # drop low-confidence results
with_payload=True,
)
if not results:
return {"matched": False, "alert_name": alert_name, "chunks": []}
return {
"matched": True,
"alert_name": alert_name,
"chunks": [
{
"text": r.payload["text"],
"source": r.payload["source_path"],
"score": round(r.score, 4),
"chunk_index": r.payload["chunk_index"],
}
for r in results
],
}
For Slack delivery, use Block Kit's section
block with a mrkdwn
text field to render the runbook chunk inline alongside the alert details. Include the source_path
and score
so engineers immediately know where it came from and how confident the match is.
The silent failure mode is a RAG that returns plausible-but-wrong runbook steps with high confidence.
Most teams evaluate their RAG pipeline by asking "does the LLM answer look right?" That's the wrong question. You need to evaluate whether the retrieved chunks were actually the correct runbook sections before any LLM even sees them. A well-phrased wrong answer is worse than an obvious failure.
Build a golden dataset: 20-30 pairs of (alert_name, expected_runbook_section)
. Run recall@3 checks β does the correct chunk appear in the top 3 results? That's your baseline metric. For a more structured eval, the ragas library (v0.1.x) provides context_recall
and answer_relevancy
metrics. Note that ragas requires openai>=1.0.0
and makes separate LLM calls for scoring β budget for that API cost in your eval pipeline, it's not free.
Run this eval gate on every significant change to the runbook corpus or after swapping embedding models. I caught a 15% recall drop after a Confluence space reorganization that changed page titles β the metadata-extracted alert_name
fields shifted, and the pre-filter was excluding correct results. Without the eval gate, that would have silently degraded on-call for weeks.
Your vector store holds internal hostnames, escalation contacts, and credential patterns β treat it like production infrastructure.
This is the access control gap I see most often. Teams move runbooks into a vector DB, wire up a query API, and mark it "internal only" as if that's sufficient. Runbooks regularly contain things like internal service hostnames, credential rotation procedures, escalation phone trees, and network topology details. If a service account with access to your RAG query API is compromised, an attacker can enumerate your entire operational playbook through semantic search.
Enforce collection-level ACLs in Qdrant using per-collection API keys. In Weaviate, use RBAC to scope read access by team. Never expose the RAG query endpoint without authentication, even on an internal network β lateral movement from a compromised service is a real threat model, not a theoretical one.
Watch out for: the Redis embedding cache also needs protection. Those cached vectors can be used to reconstruct approximate source text. Keep Redis on a private interface, require requirepass
, and set appropriate bind
directives. I stopped treating the cache layer as "just an optimization" after reading about embedding inversion attacks β they're not academic anymore.
Also store last_updated
as a metadata field on every point. Without it, you have no way to surface a staleness warning to the on-call engineer when the best matching runbook is months old. This is a cheap field to add and an expensive oversight to fix after the fact. For more on securing internal tooling pipelines, see the patterns we cover at kuryzhev.cloud.
Naive re-indexing pipelines multiply embedding costs fast β cache aggressively and schedule smart.
At first glance, embedding costs look trivial. Five hundred runbook pages at roughly 10 chunks each, priced at text-embedding-ada-002
's $0.0001 per 1K tokens, works out to about $0.25 per full re-index. That sounds fine. But a naive pipeline that re-embeds everything on every CI merge, or that re-indexes when Confluence sends a webhook for a minor edit, turns that $0.25 into a daily charge. At scale with a self-hosted GPU model, it becomes compute time you're burning for no reason.
The fix is two-layered. First, the Redis embedding cache with key pattern emb:v1:{sha256(chunk_text)}
β identical chunk content across different documents or pipeline runs hits the cache, not the model. Include a version prefix (v1
) so that when you upgrade your embedding model, you can invalidate the entire cache cleanly by bumping to v2
without touching cache logic. Second, schedule full re-indexes weekly. Run incremental re-indexing (changed documents only, via hash comparison) on every merge to main
. This keeps the index current without re-embedding stable content.
One more cost lever: use gRPC instead of HTTP for Qdrant batch upserts. The default HTTP port is 6333
, gRPC is 6334
. Switching to gRPC gives approximately 30% lower latency on batch operations β not a cost saving directly, but it reduces the wall-clock time your ingestion job runs, which matters if you're paying for the compute that runs it.