99.9% Uptime Isn’t Enough: Rethinking SLOs for Probabilistic AI Systems

wpnews.pro

Your service is responding. Your users are furious. The monitor is green. Something is fundamentally broken about how we measure reliability when the output is the product.

We inherited our reliability playbook from a world where correct meant deterministic. It doesn’t anymore.

Somewhere right now, an on-call engineer is staring at a fully green dashboard while a Slack channel fills with user complaints. The LLM-powered feature is responding in 340ms. Availability is 99.97%. Error rate is 0.2%. Every SLO is met. And the product is quietly producing wrong, incoherent, or harmful output at scale.

This is the reliability gap of the AI era — and it’s not a monitoring gap. It’s a conceptual one. The SLO framework we use was designed for systems where failure is binary: a request either succeeds or it doesn’t. AI systems break that contract entirely. A response can be fast, successful, and completely useless at the same time.

Traditional SLOs rest on three pillars: availability (is the system up?), latency (is it fast?), and error rate (is it returning valid responses?). These emerged from web services, databases, and APIs — systems where the correctness of a response is largely structural. A 200 OK with a valid JSON body is a successful request. Full stop.

AI systems invalidate this at the application layer. A language model returning a 200 with syntactically valid JSON can simultaneously be hallucinating facts, ignoring instructions, producing biased outputs, leaking prior conversation context, or generating content that violates your usage policies. None of that registers in your existing SLO framework. None of it pages anyone.

Before redesigning SLOs, engineers need a taxonomy of AI-specific failures. They fall into categories that don’t map cleanly to HTTP status codes:

Notice that none of these are detectable by latency percentiles, error budgets, or uptime calculations. They are quality failures in a system that has no existing quality SLO.

What would it actually look like to define SLOs for probabilistic systems? The honest answer is that we need new metrics that accept output quality as a first-class engineering concern — not a product concern, not a QA concern, but something that wakes people up at 3am.

These aren’t hypothetical. Teams building production AI systems are starting to define commitments like these internally — even if they aren’t called SLOs yet. The challenge is measurement. Unlike latency, you can’t instrument quality with a stopwatch.

This is where most teams get stuck. They accept that quality matters, but reject quality SLOs because they seem unmeasurable at production scale. There are three practical approaches, each with real tradeoffs:

LLM-as-judge sampling. Run a lightweight judge model against a statistically significant sample of production traffic — typically 1–5%. The judge evaluates responses against a rubric: did the response follow the format? Is it factually grounded? Does it violate any policy? This gives you a continuous quality signal with manageable cost. The catch: the judge can be wrong too, and its own drift needs monitoring.

Behavioral canaries. Maintain a golden set of input-output pairs with known expected behavior. Run these against your live system on a schedule — every deploy, every hour, every model provider update. When a canary fails, you have a concrete regression signal. This is the closest analog to a unit test that AI systems have. It doesn’t cover the full distribution, but it catches regressions reliably.

User signal instrumentation. Implicit signals — regeneration requests, session abandonment, thumbs-down clicks, downstream action completion — are weak proxies for quality. Individually they’re noisy. As a composite metric smoothed over rolling windows, they become the most honest signal you have: real users voting with their behavior. The problem is lag. By the time user signals degrade, you’ve already shipped bad responses at scale.

The goal isn’t to throw away traditional SLOs — latency and availability still matter. It’s to layer quality SLOs on top, with their own error budgets and burn rate alerts. Here’s what that translation looks like in practice:

Google’s SRE model introduced the error budget — a quantified allowance for unreliability that teams spend against when shipping. If your error budget is 0.1% downtime per month, every outage burns from that budget. When it’s gone, you stop shipping features and focus on reliability.

AI systems need a quality budget. The insight is the same: you are allowed to be imperfect, but imperfection is a finite, tracked resource. A team might define a quality budget as: *≤ 4% of responses may fail quality checks per rolling 7-day window. *When you’re burning that budget at 3x the baseline rate, that’s a quality incident. It pages someone. It blocks deploys.

In a traditional system, an incident has a clear cause: a bad deploy, a saturated database, a misconfigured load balancer. There’s a stack trace. There’s a blast radius. There’s a rollback button.

An AI quality incident is structurally different. The “failure” is statistical — not every request is bad, a meaningful percentage is. The cause might be a model provider update you didn’t control, a prompt change that introduced an edge case at the 95th percentile, a new user behavior pattern your evals never covered, or genuine model regression from an upstream weight update.

This means your runbooks need new sections. What does a quality incident even look like in your alerting system? What’s the first diagnostic step? How do you quantify blast radius when the failure is probabilistic? Who owns it — the ML team, the platform team, the product team? These questions sound organizational, but they’re really engineering problems that need engineering answers baked into your system design before the incident happens.

Probabilistic systems can’t promise deterministic outcomes. That’s not a solvable problem — it’s a fundamental property of the technology. The goal of AI SLOs isn’t to eliminate uncertainty. It’s to characterize it, track it, budget for it, and build the organizational muscles to respond when it degrades faster than expected.

The teams shipping reliable AI products in 2026 aren’t the ones that got lucky with a stable model. They’re the ones that treated output quality as an engineering discipline with the same rigor they’d apply to database latency or API availability. They built evals before incidents, not after. They defined what “degraded” means before they needed to explain it to a VP at midnight.

Your 99.9% uptime SLO tells your users the service will respond. Your quality SLO tells them it will respond with something worth trusting. The first one is table stakes. The second one is the actual product promise — and right now, most teams are making it without any way to keep it.

99.9% Uptime Isn’t Enough: Rethinking SLOs for Probabilistic AI Systems was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article The Sub-$3 Power Meter: Measuring Edge AI Energy Consumption Without an SMU How Will The Future Software Engineer Distinquish Itself When Writing Code Becomes Fully… Agent Skills: The Composition Cliff

99.9% Uptime Isn’t Enough: Rethinking SLOs for Probabilistic AI Systems

Run your AI side-project on zahid.host