The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations

A developer proposes a formal automation autonomy spectrum for AI-assisted SRE operations, defining five levels from manual to autonomous with a critical constraint that no action may remain at Level 4 permanently. The framework is motivated by the 2021 Fastly CDN outage, which highlighted the speed asymmetry between automated propagation and human response. The spectrum ties autonomy level to confidence, blast radius, novelty, and regulatory context, requiring scheduled re-qualification reviews for fully autonomous actions.

On April 23, 2021, a Fastly CDN configuration change triggered a global outage that took down the UK government website, the New York Times, Reddit, and hundreds of other major internet properties for approximately one hour. The triggering event was a configuration push. The propagation mechanism was automated. The time between the configuration being pushed and the global impact becoming visible was under a minute. The time required for a human operator to identify the cause and initiate the rollback was approximately forty-nine minutes longer than that. The Fastly incident is not primarily a story about automation failure. It is a story about the speed asymmetry between automated propagation and human response — and about what happens when the automation layer between a human decision and its production consequence moves faster than the accountability layer designed to govern it. This asymmetry is the defining operational challenge of AI-assisted SRE. The capability to automate incident detection, root cause hypothesis generation, and even remediation is now accessible at costs and latencies that were unavailable five years ago. The operational risk is not that this capability will be under-used. The risk is that it will be deployed without a rigorous escalation policy — a formal framework that defines exactly where automated execution ends and human judgement begins, under what conditions the boundary shifts, and how accountability is preserved for every action the AI takes on behalf of an operator who may not have been in the room when it was taken. AI-assisted SRE operations do not exist at a single point on the autonomy spectrum. They exist across a range, and the appropriate position on that range is a function of confidence, blast radius, novelty, and regulatory context — not of how sophisticated the AI system is. THE AUTOMATION AUTONOMY SPECTRUM ──────────────────────────────────────────────────────────────────────────── LEVEL 0 — MANUAL AI generates no recommendations. Human observes raw telemetry and decides. Appropriate when: AI system is unavailable, untrusted, or context is outside AI training distribution entirely. LEVEL 1 — ASSISTED AI surfaces relevant context, correlated signals, and historical patterns. Human makes all decisions. AI does not recommend actions. Appropriate when: novel failure pattern; first occurrence of incident type; regulated change requiring documented human judgement. LEVEL 2 — SUPERVISED AI recommends specific actions with confidence scores. Human approves each action before execution. AI does not execute autonomously. Appropriate when: high blast radius; unfamiliar but not novel pattern; action is reversible but consequential. LEVEL 3 — CONDITIONAL AUTONOMOUS AI executes actions autonomously within pre-approved policy boundaries. Human is notified after execution. Human can abort within a defined window. Appropriate when: well-characterised failure pattern; low blast radius; action is fully reversible; pattern seen N times with consistent outcome. LEVEL 4 — AUTONOMOUS AI executes and verifies remediation without human notification unless verification fails. Audit trail maintained. Appropriate when: toil pattern fully characterised; action is idempotent; blast radius is bounded to a single service; recurrence rate justifies zero-latency response. ──────────────────────────────────────────────────────────────────────────── CRITICAL CONSTRAINT: No action may exist permanently at Level 4. Every Level 4 automation must have a scheduled re-qualification review that reassesses whether the failure pattern is still well-characterised and the blast radius assumption still holds. ──────────────────────────────────────────────────────────────────────────── The critical constraint — that no action may exist permanently at Level 4 — is not conservatism. It is the engineering response to a specific failure mode: automation that was correctly calibrated at deployment time and has silently drifted out of calibration as the system evolved. An OOM restart automation that was safe when first deployed becomes unsafe the moment the underlying cause shifts from a memory leak to a data corruption event that is triggering the same symptom. The re-qualification review is the mechanism that catches this drift before it produces an incident. Every escalation policy is built from four primitive triggers. Each trigger defines a condition under which the automation level must shift upward — toward more human involvement, not less. The AI system's confidence in its diagnosis or recommended action has fallen below a defined threshold. In the context of LLM-based operations HolmesGPT, LiteLLM Proxy routing , confidence is expressed as a combination of model-reported token probability distributions and domain-specific heuristics applied to the recommendation output. A low-confidence diagnosis means the AI has identified a plausible pattern match but lacks sufficient corroborating signal to recommend action without human review. Executing actions based on low-confidence diagnoses is the operational equivalent of acting on a single data point in a monitoring dashboard: occasionally correct, reliably dangerous as a policy. The proposed action affects more infrastructure than the policy authorises for autonomous execution. Blast radius is assessed across three dimensions: service count how many services are affected , traffic fraction what percentage of user requests are served by the affected infrastructure , and reversibility can the action be undone in under five minutes with a single command . High blast radius is not a disqualifying condition for automation. It is a condition that requires the automation level to shift to at least Level 2 supervised regardless of confidence score. The failure pattern does not match any pattern in the AI system's training corpus or historical incident database. Novelty is the most dangerous condition for autonomous execution because it is precisely the condition where the AI's pattern-matching capability provides the least value — and where a confident-sounding but incorrect recommendation carries the highest operational cost. Novelty detection is the hardest trigger to implement well, because it requires the AI system to accurately assess the boundaries of its own knowledge. A system that cannot reliably distinguish "I have seen this pattern and am confident" from "I have seen a superficially similar pattern and am extrapolating" should not be operating at Level 3 or Level 4. The proposed action would touch a regulated asset, require a documented change record, affect a system subject to NERC CIP, PCI-DSS, HIPAA, or equivalent obligations, or generate a compliance event. In regulated environments, no automated action may bypass the change management governance framework, regardless of confidence score or blast radius. This trigger is absolute. It does not have a confidence threshold exception. An AI system that correctly diagnoses a production issue with 99% confidence and proposes a remediation that would constitute an undocumented change to a regulated asset must escalate to Level 2 and generate a change record, even if the remediation would restore service faster without it. The escalation policy is an operational governance document, not a configuration file. It must be version-controlled, reviewed and approved by SRE leadership and compliance, and referenced in every AI-assisted automation's runtime configuration. Its authority derives from human review, not from the AI system that consults it. ESCALATION POLICY: AI-ASSISTED INCIDENT RESPONSE ──────────────────────────────────────────────────────────────────────────── Service: production-platform all services AI System: HolmesGPT + LiteLLM Proxy + Ollama / GitHub Models Policy Version: v1.3 | Approved: SRE Lead + VP Engineering Last Reviewed: 2025-Q1 | Next Review: 2025-Q2 ──────────────────────────────────────────────────────────────────────────── SECTION 1: AUTONOMOUS EXECUTION AUTHORISED Level 4 Conditions required ALL must be true : ✓ Confidence score ≥ 0.85 model-reported + heuristic composite ✓ Pattern seen ≥ 10 times in incident history with consistent outcome ✓ Blast radius: single service, single namespace, ≤ 20% of replicas ✓ Action is idempotent and fully reversible in ≤ 5 minutes ✓ No regulated asset in scope ✓ Error budget 25% remaining not in Tier 3 freeze Authorised actions at Level 4: → Rolling restart of single stateless deployment OOM, deadlock → Scale-up of single HPA-managed deployment by ≤ 2 replicas → Certificate rotation on non-production workloads → Log pipeline gateway restart telemetry outage, no production impact Required logging: structured Splunk event per action mandatory Re-qualification: every 90 days or after any incident where autonomous action was taken and outcome was suboptimal SECTION 2: SUPERVISED EXECUTION Level 2 — Human Approval Required Conditions triggering Level 2 ANY is sufficient : ⚠ Confidence score 0.60–0.84 ⚠ Blast radius: 20% of replicas OR 1 service OR cross-namespace ⚠ First or second occurrence of this failure pattern ⚠ Error budget between 25–75% Tier 2 degraded ⚠ Action affects shared infrastructure Argo CD, Prometheus, Istio Approval mechanism: Slack approval button with 10-minute timeout Timeout behaviour: escalate to on-call if no response in 10 minutes Required logging: recommendation + approval/rejection + outcome SECTION 3: ASSISTED ONLY Level 1 — No Action Authorised Conditions triggering Level 1 ANY is sufficient : ✗ Confidence score < 0.60 ✗ Novel failure pattern no match in incident history ✗ Regulated asset in scope NERC CIP, PCI-DSS, HIPAA boundary ✗ Error budget < 25% Tier 3 freeze — deployment freeze active ✗ Active P0 incident in progress human incident commander owns scope ✗ Multiple simultaneous incidents blast radius assessment unreliable AI role at Level 1: surface correlated signals, historical context only Human owns: diagnosis, action decision, execution, verification SECTION 4: ACCOUNTABILITY CHAIN Every AI-assisted action must trace to one of: a Direct human approval Level 2 Slack approval button b This policy document Level 4 autonomous execution "The AI decided" is not a complete accountability chain. Policy document owner: SRE Lead Policy review and approval authority: SRE Lead + VP Engineering ──────────────────────────────────────────────────────────────────────────── The escalation policy document defines the governance rules. The escalation architecture implements those rules as runtime logic in the AI-assisted operations stack. The architecture shown here is specific to the HolmesGPT + LiteLLM Proxy + Ollama deployment pattern in a regulated on-premises environment. HolmesGPT Escalation Policy ConfigMap Consumed by HolmesGPT at runtime to determine autonomy level per action Version-controlled in git; updated only via Argo CD sync change record enforced apiVersion: v1 kind: ConfigMap metadata: name: holmesgpt-escalation-policy namespace: holmesgpt annotations: sre.internal/policy-version: "v1.3" sre.internal/approved-by: "sre-lead,vp-engineering" sre.internal/approved-date: "2025-03-15" sre.internal/next-review: "2025-06-15" sre.internal/review-enforced-by: "kyverno-policy/ai-ops-policy-review" data: escalation policy.yaml: | confidence thresholds: autonomous: 0.85 supervised: 0.60 assisted only: 0.0 blast radius limits: autonomous: max replica fraction: 0.20 max service count: 1 max namespace count: 1 cross namespace allowed: false regulated assets allowed: false autonomous actions allowlist: - action: rolling restart stateless max replicas affected: 5 requires pdb check: true - action: hpa scale up max replica delta: 2 requires current below sot: true - action: log pipeline restart namespaces: monitoring, sre-platform production namespaces blocked: true error budget gates: tier 3 freeze blocks autonomous: true tier 2 degrades to supervised: true regulatory boundary: always level 1 namespaces: - pci-zone - hipaa-zone - nerc-cip-zone always level 1 labels: - "compliance.internal/regulated=true" novelty detection: min historical occurrences for autonomous: 10 similarity threshold: 0.80 unknown pattern forces level 1: true approval workflow: slack channel: "sre-aiops-approvals" timeout minutes: 10 timeout action: escalate to oncall audit: splunk sourcetype: "sre:holmesgpt:decisions" log all recommendations: true log operator overrides: true override feeds prompt review: true The LiteLLM Proxy's model routing configuration is a first-class component of the escalation architecture. Routing to the right model at the right confidence tier is not a performance optimisation — it is a safety mechanism. LiteLLM Proxy — Model Routing for Escalation Tiers Smaller local models for low blast radius / routine patterns Larger models with greater context window for high blast radius / novel patterns On-premises models for regulated asset investigations data sovereignty model list: Tier 1: Routine investigation — local Ollama model Low latency, no data egress, adequate for well-characterised patterns - model name: holmesgpt-routine litellm params: model: ollama/llama3.1:8b api base: http://ollama.ai-ops.svc.cluster.local:11434 timeout: 30 max tokens: 2048 Tier 2: Complex investigation — larger local model Higher accuracy for multi-service correlation and novel patterns - model name: holmesgpt-complex litellm params: model: ollama/llama3.1:70b api base: http://ollama.ai-ops.svc.cluster.local:11434 timeout: 90 max tokens: 8192 Tier 3: High-stakes / novel pattern — GitHub Models Largest context window for multi-service incident correlation Data classification check required before routing: no PII, no regulated data - model name: holmesgpt-highstakes litellm params: model: github/gpt-4o api base: https://models.inference.ai.azure.com api key: "os.environ/GITHUB MODELS PAT" timeout: 120 max tokens: 16384 router settings: routing strategy: custom routing logic: | Route by blast radius tier header set by HolmesGPT pre-routing assessment if blast radius tier == "low" and pattern novelty == "known": return "holmesgpt-routine" elif blast radius tier == "high" or pattern novelty == "novel": Data classification gate before external model routing if data contains regulated fields: return "holmesgpt-complex" Stay on-premises return "holmesgpt-highstakes" else: return "holmesgpt-complex" fallback model: holmesgpt-complex Always fall back to on-premises fallback on status codes: 429, 500, 503 The operational risk of AI-assisted recommendations is not static. It evolves as the system changes and as the model's training distribution diverges from the current operational reality. An AI recommendation quality feedback loop is the mechanism that makes this drift visible before it produces a damaging autonomous action. Prometheus Recording Rules — AI Recommendation Quality Tracking Measures whether HolmesGPT recommendations are operationally valuable High override rate or low action rate = recommendation quality degrading groups: - name: holmesgpt.recommendation quality rules: Recommendation acceptance rate: fraction of recommendations that operators acted on approved or executed autonomously versus rejected or ignored - record: holmesgpt:recommendation acceptance rate:rate7d expr: | sum rate holmesgpt recommendations acted on total 7d / sum rate holmesgpt recommendations total 7d Operator override rate: fraction of autonomous actions that were manually reversed by an operator after execution High rate = autonomous confidence thresholds are too permissive - record: holmesgpt:autonomous override rate:rate7d expr: | sum rate holmesgpt autonomous actions reversed total 7d / sum rate holmesgpt autonomous actions total 7d False positive rate: recommendations made but outcome was NOT the recommended action resolving the incident - record: holmesgpt:false positive rate:rate7d expr: | sum rate holmesgpt recommendations outcome mismatch total 7d / sum rate holmesgpt recommendations acted on total 7d Alert: recommendation quality degrading - alert: HolmesGPT RecommendationQualityDegrading expr: | holmesgpt:autonomous override rate:rate7d 0.15 OR holmesgpt:false positive rate:rate7d 0.20 for: 1d labels: severity: ticket domain: ai ops quality annotations: summary: HolmesGPT recommendation quality below threshold. Override rate: {{ with query "holmesgpt:autonomous override rate:rate7d" }} {{ . | first | value | humanizePercentage }}{{ end }}. Action: review recent overrides, update prompt context, consider reducing autonomous confidence threshold. runbook: "https://wiki.internal/sre/runbooks/holmesgpt-quality-review" Alert: recommendation volume causing alert fatigue risk More than 3 recommendations per incident = cognitive overload signal - alert: HolmesGPT RecommendationVolumeHigh expr: | sum rate holmesgpt recommendations total 1h / sum rate incidents opened total 1h 3 for: 30m labels: severity: ticket annotations: summary: HolmesGPT generating 3 recommendations per incident on average. Risk: alert fatigue causing operators to ignore recommendations. Action: tighten confidence floor or reduce recommendation scope. The accountability chain principle — that every AI-assisted action must trace back to a human decision, either a direct approval or a policy that a human wrote and approved — is the operational implementation of the NIST AI Risk Management Framework's GOVERN function. The NIST AI RMF establishes four core functions for AI risk management: GOVERN policies, accountability , MAP risk identification , MEASURE risk quantification , and MANAGE risk response . Each function maps directly to components of the escalation policy architecture. NIST AI RMF MAPPING: AI-ASSISTED SRE OPERATIONS ──────────────────────────────────────────────────────────────────────────── GOVERN — Accountability and Policy Who owns the AI system's outputs? → SRE Lead owns escalation policy; VP Engineering co-approves Who approves autonomous action boundaries? → Policy document with named approvers and review cadence How are accountability chains maintained? → Splunk audit trail: every recommendation, decision, and outcome SRE implementation: escalation policy document + approval workflow MAP — Risk Identification What failure modes does the AI system face? → Confidence decay: model accuracy degrades as system evolves → Distribution shift: production patterns diverge from training data → Novel pattern extrapolation: confident recommendation on unfamiliar input → Blast radius miscalculation: action scope larger than assessed SRE implementation: four escalation triggers + novelty detection MEASURE — Risk Quantification How do you measure AI recommendation quality over time? → Acceptance rate: fraction of recommendations acted on → Override rate: fraction of autonomous actions manually reversed → False positive rate: recommendations where predicted outcome was wrong → Confidence calibration: does 85% confidence actually mean 85% accuracy? SRE implementation: Prometheus quality recording rules + 7-day rolling metrics MANAGE — Risk Response What happens when AI recommendation quality degrades? → Automatic downgrade of autonomous confidence threshold → Prompt context refresh from recent incident postmortems → Temporary suspension of Level 4 autonomy pending review SRE implementation: quality alert → runbook → policy review cadence ──────────────────────────────────────────────────────────────────────────── In regulated environments, the audit trail for AI-assisted actions is not optional. It is the documentary evidence that demonstrates human accountability over automated decisions — the record that answers the auditor's question: "Who authorised this change to your production system?" Splunk HEC Forwarder — HolmesGPT Decision Audit Trail Every recommendation, escalation decision, and outcome → Splunk This record is the accountability chain in documentary form Splunk event structure sourcetype: sre:holmesgpt:decisions : { "timestamp": "2025-04-15T14:23:07Z", "incident id": "INC-20250415-0047", "alert name": "KubePodOOMKilled", "service": "payments-api", "namespace": "production", "investigation": { "model used": "holmesgpt-routine", "model backend": "ollama/llama3.1:8b", "confidence score": 0.91, "diagnosis": "Memory limit 2Gi exceeded by 847MB under high load...", "recommended action": "rolling restart stateless", "blast radius assessment": { "services affected": 1, "replica fraction": 0.15, "reversible": true, "regulated asset": false } }, "escalation decision": { "autonomy level": 4, "policy version": "v1.3", "triggers evaluated": "confidence", "blast radius", "novelty", "regulatory" , "triggers fired": , "decision": "AUTONOMOUS EXECUTE", "policy authority": "holmesgpt-escalation-policy v1.3 approved: sre-lead " }, "execution": { "action taken": "rolling restart stateless", "execution start": "2025-04-15T14:23:09Z", "verification result": "HEALTHY", "mttr seconds": 67, "operator override": false }, "quality signals": { "prediction matched outcome": true, "error budget consumed pct": 0.002, "operator satisfaction": null Populated by post-incident feedback } } The policy authority field in the escalation decision block is the accountability chain closure. It names the specific policy document version and its human approvers. When an auditor asks who authorised the autonomous action, the answer is not "the AI decided" — it is "the SRE Lead and VP Engineering approved escalation policy v1.3 on 2025-03-15, and this action fell within the boundaries of Section 1 of that policy." A confidence score of 0.85 from a language model does not intrinsically mean that the recommendation is correct 85% of the time. Language models are notoriously poorly calibrated — they express high confidence in incorrect outputs and sometimes express low confidence in correct ones. The confidence threshold in the escalation policy must be calibrated against the AI system's actual historical accuracy, not against the model's self-reported certainty. -- Splunk SPL: Confidence Calibration Assessment -- Compares model-reported confidence bands against actual outcome accuracy -- Run monthly; output informs confidence threshold calibration in policy index=sre holmesgpt sourcetype="sre:holmesgpt:decisions" | eval confidence band = case confidence score = 0.90, "90-100%", confidence score = 0.85, "85-89%", confidence score = 0.80, "80-84%", confidence score = 0.70, "70-79%", confidence score = 0.60, "60-69%", true , "<60%" | stats count as total recommendations, sum prediction matched outcome as correct predictions, avg prediction matched outcome as empirical accuracy, sum operator override as operator overrides by confidence band, model used | eval calibration delta = empirical accuracy - tonumber substr confidence band,1,2 /100 , calibration status = if abs calibration delta < 0.10, "CALIBRATED", "MISCALIBRATED" | table confidence band, model used, total recommendations, empirical accuracy, calibration delta, calibration status, operator overrides | sort confidence band -- If empirical accuracy at "85-89%" band is actually 0.71: -- The 0.85 autonomous threshold is accepting actions that are only -- correct 71% of the time. Raise threshold or re-evaluate model. The Confidence Theatre antipattern → Using model-reported confidence scores as the primary autonomous execution gate without calibration against empirical outcome accuracy. A model that reports 0.92 confidence but is empirically correct 68% of the time is a dangerous basis for autonomous action. Calibration against historical outcomes must precede the deployment of any confidence-based gate. The Policy-as-Default antipattern → Deploying the AI system with permissive defaults and planning to tighten the escalation policy "after we see how it performs in production." The escalation policy must be the first artefact produced, not a retroactive constraint on a system that is already taking autonomous actions. Permissive defaults in AI operations systems are not starting points; they are incident preconditions. The Accountability Diffusion antipattern → Designing the system so that no single person is clearly accountable for an autonomous AI action. "The AI did it" is not an accountability chain. "The escalation policy approved by names on date authorised this class of action" is. In regulated environments, the inability to name a responsible human for a production change is itself a compliance finding. The Alert Fatigue Transfer antipattern → Moving from a system that generates too many monitoring alerts to a system that generates too many AI recommendations. If HolmesGPT surfaces seven recommendations per incident, operators will start ignoring them at the same rate they ignore high-volume monitoring alerts. Recommendation volume should be governed by the same principles as alert volume: every recommendation must be actionable, and the threshold for surfacing should be higher than the threshold for suppressing. The Permanent Level 4 antipattern → Classifying an autonomous action as Level 4 and never re-qualifying it. The re-qualification cadence is the mechanism that prevents a well-calibrated autonomous action from silently becoming a dangerous one as the system evolves. Every Level 4 action must carry a sre.internal/sot-next-review equivalent annotation and a Kyverno policy that generates a ticket when the date passes. ──────────────────────────────────────────────────────────────────────────── STAGE AI-OPS ESCALATION STATE NORTH STAR SIGNAL ──────────────────────────────────────────────────────────────────────────── Reactive No AI-assisted operations. All investigation is Operators work from raw manual. MTTR limited telemetry only. by human availability. Defined HolmesGPT deployed at AI operating at Level Level 1 only. Escalation 1–2 only. Context policy drafted but not surfacing measurably yet governing autonomous reduces investigation action. time. Measured Escalation policy governs Recommendation quality Level 3–4 boundaries. metrics tracked. Confidence Audit trail in Splunk. calibration assessed Quality metrics active. monthly. Override rate below 15%. Optimised Confidence calibration Level 4 actions cover cycle running quarterly. top-5 toil remediations. Model routing by blast MTTR for covered patterns radius operational. < 5 minutes automated . NIST AI RMF aligned. Audit trail satisfies regulatory review. Generative Escalation policy published Policy cited in industry as reference architecture. guidance. Recommendation Feedback loop feeds quality above 85%. prompt engineering cycle. AI-ops layer itself AI-ops treated as a has SLO and error budget. production service. ──────────────────────────────────────────────────────────────────────────── Draft your escalation policy document before configuring any autonomous action in HolmesGPT. Start with the accountability chain section: who owns the policy, who approves autonomous action boundaries, and what the change record looks like. A policy document that exists on paper but has not been approved by SRE leadership and VP Engineering is not a governance artefact — it is a draft. The approval is the governance act. Run the Splunk confidence calibration query against your last 90 days of HolmesGPT decisions. If you do not yet have 90 days of data, start collecting it now at Level 1 only. Calibration data must precede autonomous execution boundaries. The calibration query is the empirical basis for your confidence thresholds — thresholds chosen without it are guesses with operational consequences. Map every existing automated remediation to an autonomy level and a blast radius assessment. For each automation in your Class 1 Reactive Remediation category from the automation taxonomy post, assess: what is its blast radius under worst-case conditions, and what confidence mechanism governs when it executes? Automations with no explicit blast radius boundary and no confidence mechanism are operating at implicit Level 4 without a policy. Make the policy explicit before the next incident. Configure the recommendation quality Prometheus rules and set a 30-day baseline. Even if you are operating at Level 1 only, begin measuring acceptance rate and false positive rate now. The first meaningful governance conversation about elevating to Level 3 or Level 4 should be anchored in empirical quality data, not in enthusiasm about the capability. Add the four escalation triggers as literal fields to your HolmesGPT Splunk audit events. Every decision event should record: confidence trigger fired: true/false , blast radius trigger fired: true/false , novelty trigger fired: true/false , regulatory trigger fired: true/false . Over time, this data reveals which triggers are governing your escalation decisions most frequently — and which failure modes your autonomous boundary is most exposed to. "The risk in AI-assisted SRE is not that the automation will fail to act. The risk is that it will act confidently, at scale, on a pattern it has only partially understood — and that the human who approved the policy that authorised the action will not be reachable, will not remember what the policy said, or will not realise the policy applied to this situation. The escalation policy is not a constraint on AI capability. It is the engineering discipline that makes AI capability safe to deploy in systems where the cost of being confidently wrong is borne by users, not by the model." The escalation policy governs how AI recommendations become actions. The harder engineering problem is the quality of the recommendations themselves — specifically, how to evaluate LLM reliability for incident diagnosis with the same rigour that SRE applies to any other production dependency. The next post examines what it means to apply an SLO framework to an AI system: defining SLIs for recommendation accuracy, precision, and recall; setting error budgets for the AI-ops layer; and designing the automated quality gates that prevent a degrading LLM backend from silently undermining the operational decisions that depend on it.