Detecting API anomalies behind a 200 OK — with statistics, not AI

A developer built an anomaly detection system for API endpoints that return 200 OK but are actually broken, using per-endpoint rolling baselines and three-sigma thresholds instead of machine learning. The system tracks response size and time, flags anomalies when values exceed a combined threshold of statistical and practical significance, and requires two consecutive anomalous checks before alerting. An LLM is used only to generate human-readable explanations after detection, not to decide what is anomalous.

Most uptime monitors answer one question: is it up or down? But some of the worst incidents I've dealt with returned a perfectly happy 200 OK: an endpoint that started serving a cached error page a JSON API returning {"error": ...} with status 200 a response that quietly got 10× slower a payload that dropped from 14 KB to 800 bytes because a backend started returning empty results. A plain up/down check sails straight past all of these. I wanted my monitor to notice "it's up, but it's wrong." Here's how I built that — and why I deliberately didn't reach for machine learning or the word "AI" . THE TEMPTATION, AND WHY I SKIPPED IT The buzzword move is "AI-powered anomaly detection." But for per-endpoint metrics, ML is mostly overkill: you need training data, the model is opaque, and it's hard to explain why something fired. Plain statistics are simpler, cheaper, deterministic, and — importantly — explainable. So that's what I used. ONE BASELINE PER ENDPOINT The key decision: every endpoint is its own baseline. A CDN-cached 2 KB JSON response and a 500 KB HTML page have nothing in common, so a global threshold is meaningless. I track two signals per endpoint: response size bytes response time ms For each, I keep a rolling baseline and ask: is the latest value weird for this endpoint? THE MATH: ROLLING MEAN, STD, AND 3Σ Standard stuff — flag a value when it's more than three standard deviations from the mean: |value − mean| 3 · σ The trick is computing it cheaply. I don't want to load an endpoint's entire history on every check. Mean and variance only need three running aggregates — count, sum, and sum of squares — which is a single SQL query: SELECT COUNT AS n, SUM value AS s, SUM value value AS q FROM signal WHERE endpoint id = ? AND created at = ?; -- rolling window Then: mean = s / n variance = max 0.0, q / n - mean mean std = variance 0.5 No history transfer, no model, just three numbers. THE GUARDRAILS WHERE MOST OF THE REAL WORK IS Raw 3σ is noisy. The interesting part is stopping false positives: threshold = max 3 std, statistical rel floor mean, e.g. +50% for size abs floor e.g. +500 ms, an absolute minimum A change has to be statistically and practically significant. def is anomalous value, mean, std, , rel floor, abs floor, both directions : threshold = max 3 std, rel floor mean, abs floor delta = value - mean flagged = abs delta threshold if both directions else delta threshold return flagged, "up" if delta 0 else "down" No flapping. One weird check isn't an incident. I require two anomalous checks in a row, in the same direction, before alerting. Warm-up guard. Below a minimum sample count I use ~50 , I don't alert at all — there's no trustworthy baseline yet. Together these turn a noisy 3σ trigger into something that only fires when an endpoint genuinely behaves unlike itself. SO WHERE DOES THE "AI" COME IN? Here's the line I care about: detection is statistics; AI only explains. Once the math flags something, I hand the numbers to a small LLM call to turn this: payload size dropped from ~14 KB to ~800 B −94% , 2 checks in a row into this: "This endpoint is likely returning an error or empty payload instead of its usual response — the body shrank by ~94% while still answering 200 OK." The model writes the human sentence. It does not decide what's anomalous. I refuse to market a 3σ threshold as machine learning. Detection = math, explanation = language. Calling the whole thing "AI anomaly detection" would be a lie about which part is which. TAKEAWAY You don't need ML to catch "it's up but it's broken." A per-endpoint rolling baseline, a max 3σ, relative floor, absolute floor threshold, a direction rule, and a two-in-a-row guard get you surprisingly far — and every alert stays fully explainable, which beats a black box when you're staring at it at 3 a.m. This runs in PingMon pingmon.de , the uptime monitor I'm building, but the technique is general — you can bolt it onto anything with a metric history. Happy to go deeper on the windowing or the per-tier cost controls in the comments. — Dario