{"slug": "detecting-api-anomalies-behind-a-200-ok-with-statistics-not-ai", "title": "Detecting API anomalies behind a 200 OK — with statistics, not AI", "summary": "A developer built an anomaly detection system for API endpoints that return 200 OK but are actually broken, using per-endpoint rolling baselines and three-sigma thresholds instead of machine learning. The system tracks response size and time, flags anomalies when values exceed a combined threshold of statistical and practical significance, and requires two consecutive anomalous checks before alerting. An LLM is used only to generate human-readable explanations after detection, not to decide what is anomalous.", "body_md": "Most uptime monitors answer one question: is it up or down? But some of the worst incidents I've dealt with returned a perfectly happy 200 OK:\n\nan endpoint that started serving a cached error page\n\na JSON API returning {\"error\": ...} with status 200\n\na response that quietly got 10× slower\n\na payload that dropped from 14 KB to 800 bytes because a backend started returning empty results. A plain up/down check sails straight past all of these. I wanted my monitor to notice \"it's up, but it's wrong.\" Here's how I built that — and why I deliberately didn't reach for machine learning (or the word \"AI\").\n\nTHE TEMPTATION, AND WHY I SKIPPED IT\n\nThe buzzword move is \"AI-powered anomaly detection.\" But for per-endpoint metrics, ML is mostly overkill: you need training data, the model is opaque, and it's hard to explain why something fired. Plain statistics are simpler, cheaper, deterministic, and — importantly — explainable. So that's what I used.\n\nONE BASELINE PER ENDPOINT\n\nThe key decision: every endpoint is its own baseline. A CDN-cached 2 KB JSON response and a 500 KB HTML page have nothing in common, so a global threshold is meaningless. I track two signals per endpoint:\n\nresponse size (bytes)\n\nresponse time (ms)\n\nFor each, I keep a rolling baseline and ask: is the latest value weird for this endpoint?\n\nTHE MATH: ROLLING MEAN, STD, AND 3Σ\n\nStandard stuff — flag a value when it's more than three standard deviations from the mean:\n\n|value − mean| > 3 · σ\n\nThe trick is computing it cheaply. I don't want to load an endpoint's entire history on every check. Mean and variance only need three running aggregates — count, sum, and sum of squares — which is a single SQL query:\n\nSELECT COUNT(*) AS n,\n\nSUM(value) AS s,\n\nSUM(value * value) AS q\n\nFROM signal\n\nWHERE endpoint_id = ?\n\nAND created_at >= ?; -- rolling window\n\nThen:\n\nmean = s / n\n\nvariance = max(0.0, q / n - mean * mean)\n\nstd = variance ** 0.5\n\nNo history transfer, no model, just three numbers.\n\nTHE GUARDRAILS (WHERE MOST OF THE REAL WORK IS)\n\nRaw 3σ is noisy. The interesting part is stopping false positives:\n\nthreshold = max(3 * std, # statistical\n\nrel_floor * mean, # e.g. +50% for size\n\nabs_floor) # e.g. +500 ms, an absolute minimum\n\nA change has to be statistically and practically significant.\n\ndef is_anomalous(value, mean, std, *, rel_floor, abs_floor, both_directions):\n\nthreshold = max(3 * std, rel_floor * mean, abs_floor)\n\ndelta = value - mean\n\nflagged = abs(delta) > threshold if both_directions else delta > threshold\n\nreturn flagged, (\"up\" if delta > 0 else \"down\")\n\nNo flapping. One weird check isn't an incident. I require two anomalous checks in a row, in the same direction, before alerting.\n\nWarm-up guard. Below a minimum sample count (I use ~50), I don't alert at all — there's no trustworthy baseline yet.\n\nTogether these turn a noisy 3σ trigger into something that only fires when an endpoint genuinely behaves unlike itself.\n\nSO WHERE DOES THE \"AI\" COME IN?\n\nHere's the line I care about: detection is statistics; AI only explains.\n\nOnce the math flags something, I hand the numbers to a small LLM call to turn this:\n\npayload size dropped from ~14 KB to ~800 B (−94%), 2 checks in a row\n\ninto this:\n\n\"This endpoint is likely returning an error or empty payload instead of its usual response — the body shrank by ~94% while still answering 200 OK.\"\n\nThe model writes the human sentence. It does not decide what's anomalous. I refuse to market a 3σ threshold as machine learning. Detection = math, explanation = language. Calling the whole thing \"AI anomaly detection\" would be a lie about which part is which.\n\nTAKEAWAY\n\nYou don't need ML to catch \"it's up but it's broken.\" A per-endpoint rolling baseline, a max(3σ, relative floor, absolute floor) threshold, a direction rule, and a two-in-a-row guard get you surprisingly far — and every alert stays fully explainable, which beats a black box when you're staring at it at 3 a.m.\n\nThis runs in PingMon (pingmon.de), the uptime monitor I'm building, but the technique is general — you can bolt it onto anything with a metric history. Happy to go deeper on the windowing or the per-tier cost controls in the comments.\n\n— Dario", "url": "https://wpnews.pro/news/detecting-api-anomalies-behind-a-200-ok-with-statistics-not-ai", "canonical_source": "https://dev.to/dario_le/detecting-api-anomalies-behind-a-200-ok-with-statistics-not-ai-39lf", "published_at": "2026-06-15 18:01:55+00:00", "updated_at": "2026-06-15 18:06:30.353307+00:00", "lang": "en", "topics": ["developer-tools", "machine-learning", "ai-tools"], "entities": ["PingMon"], "alternates": {"html": "https://wpnews.pro/news/detecting-api-anomalies-behind-a-200-ok-with-statistics-not-ai", "markdown": "https://wpnews.pro/news/detecting-api-anomalies-behind-a-200-ok-with-statistics-not-ai.md", "text": "https://wpnews.pro/news/detecting-api-anomalies-behind-a-200-ok-with-statistics-not-ai.txt", "jsonld": "https://wpnews.pro/news/detecting-api-anomalies-behind-a-200-ok-with-statistics-not-ai.jsonld"}}