Prompt Engineering Patterns for SRE Playbooks and Postmortems An engineer proposes treating prompt engineering for site reliability engineering (SRE) workflows as reusable infrastructure, introducing three production-tested patterns: structured context injection for runbook generation, two-step postmortem synthesis, and LLM-as-reviewer for runbook auditing. The patterns rely on GPT-4o with JSON output enforcement and are designed to inject environment-specific context into LLM queries, making incident documentation actionable rather than generic. The engineer emphasizes versioning prompts like code and provides setup instructions for Python 3.11+, openai 1.30.0, and tiktoken 0.7.0. Originally published on kuryzhev.cloud Your runbook was written six months ago by someone who no longer works here — and it's 3am, three payments-api pods are OOMKilling in a cascade, and the on-call engineer is staring at a Confluence page that references a Datadog dashboard that got renamed in February. This is the real failure mode nobody talks about in postmortems: the documentation debt that compounds silently until it explodes during a P1. I've been in that seat. The thing that changed how I handle incident documentation wasn't a better wiki tool or a stricter postmortem template. It was treating prompt engineering for SRE workflows as reusable infrastructure — not one-off ChatGPT queries. Generic prompts fail in SRE contexts for a specific reason: they have no system topology, no severity framing, and no structured output contract. You ask "how do I fix an OOMKill?" and you get a textbook answer that has nothing to do with your 512Mi memory limit, your Redis connection pool, or your GKE 1.29 cluster. What you actually need is a prompt pattern that injects your environment's context and enforces a structured response you can act on immediately. This post walks through three production-tested prompt engineering patterns for SRE playbooks: structured context injection for runbook generation, two-step postmortem synthesis, and LLM-as-reviewer for runbook auditing. These patterns are most powerful when maintained like code — versioned, reviewed, and updated after every incident. Before you run any of this, get your environment straight. You'll need Python 3.11+ with two key libraries: openai==1.30.0 and tiktoken==0.7.0 . If you're on an older openai SDK, watch out — the v1.x release broke every legacy openai.ChatCompletion.create call from v0.28.x. I spent an embarrassing hour debugging that the first time I upgraded. Install the correct versions explicitly: pip install openai==1.30.0 tiktoken==0.7.0 For the model itself, you need GPT-4o access on your OpenAI API key. GPT-4o supports the response format={"type": "json object"} parameter we rely on for structured output, and its 128k context window handles roughly 90 pages of logs — enough for most incidents. Cost is approximately $5 per million input tokens as of mid-2024, so a two-step postmortem chain on a 50k-token log file runs about $0.35 per incident. At 100 incidents per month that's $35 — negligible. But set max tokens limits on every call anyway, because malformed inputs can balloon costs fast. If you're in an air-gapped environment and can't send data to OpenAI, Ollama 0.1.32+ with llama3:70b is a viable alternative. Run ollama pull llama3:70b — just know it requires roughly 40GB of disk and a minimum of 64GB RAM. Without that RAM headroom you'll hit swap thrashing and inference becomes unusable under load. You'll also want at least one real or synthetic postmortem in Markdown or JSON, a sample runbook in plain text, and optionally a PagerDuty API token for pulling live incident metadata. Store your prompt templates under a versioned path — I use the convention ./prompts/sre/postmortem synthesis v2.txt . Version your prompts like you version your Terraform modules. They are infrastructure. The core problem with generic LLM runbooks is that the model has no idea what your service actually looks like. This pattern solves that by building a structured context block — service name, dependencies, SLO thresholds, known failure modes, namespace — and injecting it as a YAML-style front-matter block before the user query. Pair this with role-priming in the system message and JSON schema enforcement in the prompt, and you get output that's immediately actionable rather than generically educational. The function below sets up the role-primed system message, builds the context block from a service metadata dictionary, and enforces a strict JSON schema for the response. We use response format={"type": "json object"} on the API call to guarantee parseable output. We also run a token count before every API call using tiktoken with the cl100k base encoding — this prevents the openai.BadRequestError: 400 - context length exceeded error that will otherwise ambush you when someone injects an entire application log as context. Watch out for this: Never inject raw 10,000-line log files as context. The model attends poorly to tokens in the middle of a massive context window. Pre-filter to error-level events and key timestamps before injection. I learned this the hard way after getting a perfectly formatted runbook that addressed the wrong error entirely — the actual OOMKill signal was buried at line 4,300 and the model focused on a harmless warning at line 200. sre prompt engine.py Prompt engineering patterns for SRE playbooks and postmortems Requires: openai==1.30.0, tiktoken==0.7.0, python 3.11+ import json import tiktoken from openai import OpenAI from pathlib import Path client = OpenAI reads OPENAI API KEY from environment --- Token safety: never exceed 100k tokens of context --- def count tokens text: str, model: str = "gpt-4o" - int: enc = tiktoken.get encoding "cl100k base" return len enc.encode text def truncate to token limit text: str, limit: int = 80000 - str: enc = tiktoken.get encoding "cl100k base" tokens = enc.encode text if len tokens limit: print f" WARN Truncating context from {len tokens } to {limit} tokens" tokens = tokens :limit return enc.decode tokens --- Pattern 1: Structured Context Injection for Playbook Generation --- def generate playbook service context: dict, alert description: str - dict: """ Injects structured service metadata + alert into a role-primed prompt. Returns a JSON-structured runbook with immediate actions, rollback command, etc. """ system prompt = """You are a senior SRE at a company running Kubernetes 1.29 on GKE with Datadog monitoring and PagerDuty alerting. You write precise, actionable runbooks. Never suggest commands you cannot verify. Flag uncertain steps with UNVERIFIED .""" Build context block from service metadata context block = f""" Service Context - Service: {service context 'name' } - Dependencies: {', '.join service context 'dependencies' } - SLO: {service context 'slo target' }% availability over 30 days - Known failure modes: {json.dumps service context 'known failures' , indent=2 } - Namespace: {service context 'namespace' } """ user prompt = f""" {context block} Active Alert {alert description} Generate a runbook for this alert. Respond ONLY with valid JSON matching this schema: {{ "title": "string", "severity": "P1|P2|P3", "immediate actions": "string" , "diagnostic commands": "string" , "escalation path": "string" , "rollback command": "string", "blast radius": "string", "estimated resolution time": "string" }} """ token count = count tokens system prompt + user prompt print f" INFO Sending {token count} tokens to GPT-4o" response = client.chat.completions.create model="gpt-4o", temperature=0.2, low temp = consistent commands, less hallucination max tokens=1500, response format={"type": "json object"}, messages= {"role": "system", "content": system prompt}, {"role": "user", "content": user prompt} return json.loads response.choices 0 .message.content --- Example usage --- if name == " main ": service ctx = { "name": "payments-api", "dependencies": "postgres-primary", "redis-cache", "stripe-gateway" , "slo target": 99.9, "namespace": "production", "known failures": "OOMKill under 500 concurrent requests", "Redis connection pool exhaustion during deploy" } alert = "OOMKilling detected on payments-api pods 3/5 pods restarted in last 10 minutes . Memory limit: 512Mi." playbook = generate playbook service ctx, alert print json.dumps playbook, indent=2 Save versioned output — treat generated runbooks as artifacts Path "./runbooks/generated/payments-api-oomkill.json" .write text json.dumps playbook, indent=2 The temperature: 0.2 setting is non-negotiable for SRE use cases. I tested this extensively — at temperature: 0.7 the model starts inventing kubectl flags that don't exist. At 0.2 you get deterministic, consistent output across runs. More on that in the verification section. The single biggest mistake I see teams make with LLM postmortem generation is using one monolithic prompt for both timeline extraction and root cause analysis. One prompt, one job. Splitting into a two-step chain improves RCA accuracy measurably — the first pass extracts a clean chronological timeline from raw logs and Slack threads, and the second pass uses that structured timeline to synthesize the full postmortem. You're giving the model cleaner, denser signal at each step rather than asking it to do two cognitively distinct tasks simultaneously. The two-step chain below enforces Google SRE postmortem format via explicit section headers in the prompt instruction. The first call extracts timeline events as a JSON array with timestamp , event , and source fields. The second call uses that output to write the full document with Summary , Timeline , Root Cause , Contributing Factors , Impact , and Action Items sections. Critical security note: Never inject raw PagerDuty API responses or Datadog alert payloads directly into prompts sent to OpenAI. Strip PII, internal hostnames, and IP addresses first. Build a sanitization function and make it mandatory in your pipeline — not optional. I treat this the same way I treat secrets management: if it's not enforced in code, it will eventually be violated under incident pressure at 3am. python --- Pattern 2: Two-Step Postmortem Synthesis --- def synthesize postmortem raw logs: str, slack thread: str - str: """ Step 1: Extract timeline from raw data. Step 2: Synthesize full postmortem from timeline. Splitting into two calls improves RCA accuracy significantly. """ Sanitize before sending — strip internal IPs and hostnames In production: replace with your sanitization function safe logs = truncate to token limit raw logs, limit=60000 Step 1: Timeline extraction — one job, clean output step1 response = client.chat.completions.create model="gpt-4o", temperature=0.1, even lower temp for factual extraction max tokens=2000, messages= { "role": "user", "content": f"""Extract a chronological incident timeline from the following logs and Slack thread. Output as a JSON array of objects with fields: timestamp, event, source logs|slack|alert . Logs: {safe logs} Slack thread: {slack thread}""" } timeline = step1 response.choices 0 .message.content Step 2: Full postmortem synthesis using the extracted timeline step2 response = client.chat.completions.create model="gpt-4o", temperature=0.2, max tokens=3000, messages= { "role": "user", "content": f"""Using this incident timeline, write a Google SRE-format postmortem. Use these exact section headers: Summary Timeline Root Cause Contributing Factors Impact Action Items Timeline: {timeline}""" } return step2 response.choices 0 .message.content After generating the postmortem JSON, pipe it through jq to assert required fields exist before writing to Confluence or Notion. Something as simple as jq 'has "root cause" and has "action items" ' will catch incomplete outputs before they become official documentation. The OOMKill diagnostic command I use to pull initial incident context before feeding to the synthesizer: kubectl get events --field-selector reason=OOMKilling -n production --sort-by='.lastTimestamp' . This is the pattern I use on every PR that touches a runbook. The idea is simple: put the LLM in a critic role, give it the existing runbook as context, and ask it to output a gap analysis as a numbered list with severity tags. The prompt explicitly includes a "devil's advocate" instruction — ask the model to simulate what breaks if each step is followed during a partial network outage. That single addition has caught more gaps than any human review I've seen. validate runbook.py Pattern 3: LLM-as-Reviewer — audit existing runbooks for gaps Outputs gap analysis suitable for GitHub PR comment or Confluence import sys from openai import OpenAI client = OpenAI CRITIC SYSTEM PROMPT = """You are a senior SRE performing a runbook audit. Your job is to find gaps, outdated commands, missing rollback steps, and unclear escalation paths. Be specific. Reference line numbers where possible. Never hallucinate fixes — if you are unsure, say so.""" def audit runbook runbook text: str, service name: str - str: response = client.chat.completions.create model="gpt-4o", temperature=0.2, max tokens=2000, messages= {"role": "system", "content": CRITIC SYSTEM PROMPT}, {"role": "user", "content": f""" Audit this runbook for service: {service name} For each issue found, output a line in this format: SEVERITY Line ~N: Description of issue. Suggested fix. Severity levels: CRITICAL WARN INFO Also answer: What breaks if this runbook is followed during a partial network outage? Runbook: --- {runbook text} --- """} return response.choices 0 .message.content if name == " main ": runbook path = sys.argv 1 if len sys.argv 1 else "./runbooks/payments-api-oomkill.md" with open runbook path as f: runbook = f.read result = audit runbook runbook, service name="payments-api" print result Write to file for GitHub Actions PR comment injection with open "./runbook audit output.txt", "w" as out: out.write result A real audit output from this pattern looks like this: CRITICAL Line ~12: kubectl delete pod used without --grace-period=0 flag. During OOMKill cascade this will hang. Use: kubectl delete pod