Detecting API anomalies behind a 200 OK — with statistics, not AI

wpnews.pro

cd /news/developer-tools/detecting-api-anomalies-behind-a-200… · home › topics › developer-tools › article

[ARTICLE · art-28350] src=dev.to ↗ pub=2026-06-15T18:01Z topic=developer-tools verified=true sentiment=↑ positive

Detecting API anomalies behind a 200 OK — with statistics, not AI

A developer built an anomaly detection system for API endpoints that return 200 OK but are actually broken, using per-endpoint rolling baselines and three-sigma thresholds instead of machine learning. The system tracks response size and time, flags anomalies when values exceed a combined threshold of statistical and practical significance, and requires two consecutive anomalous checks before alerting. An LLM is used only to generate human-readable explanations after detection, not to decide what is anomalous.

read4 min views18 publishedJun 15, 2026

Most uptime monitors answer one question: is it up or down? But some of the worst incidents I've dealt with returned a perfectly happy 200 OK:

an endpoint that started serving a cached error page

a JSON API returning {"error": ...} with status 200 a response that quietly got 10× slower

a payload that dropped from 14 KB to 800 bytes because a backend started returning empty results. A plain up/down check sails straight past all of these. I wanted my monitor to notice "it's up, but it's wrong." Here's how I built that — and why I deliberately didn't reach for machine learning (or the word "AI").

THE TEMPTATION, AND WHY I SKIPPED IT

The buzzword move is "AI-powered anomaly detection." But for per-endpoint metrics, ML is mostly overkill: you need training data, the model is opaque, and it's hard to explain why something fired. Plain statistics are simpler, cheaper, deterministic, and — importantly — explainable. So that's what I used.

ONE BASELINE PER ENDPOINT

The key decision: every endpoint is its own baseline. A CDN-cached 2 KB JSON response and a 500 KB HTML page have nothing in common, so a global threshold is meaningless. I track two signals per endpoint:

response size (bytes)

response time (ms)

For each, I keep a rolling baseline and ask: is the latest value weird for this endpoint?

THE MATH: ROLLING MEAN, STD, AND 3Σ

Standard stuff — flag a value when it's more than three standard deviations from the mean:

|value − mean| > 3 · σ

The trick is computing it cheaply. I don't want to load an endpoint's entire history on every check. Mean and variance only need three running aggregates — count, sum, and sum of squares — which is a single SQL query:

SELECT COUNT(*) AS n,

SUM(value) AS s,

SUM(value * value) AS q

FROM signal

WHERE endpoint_id = ?

AND created_at >= ?; -- rolling window Then:

mean = s / n

variance = max(0.0, q / n - mean * mean) std = variance ** 0.5

No history transfer, no model, just three numbers.

THE GUARDRAILS (WHERE MOST OF THE REAL WORK IS)

Raw 3σ is noisy. The interesting part is stopping false positives:

threshold = max(3 * std, # statistical rel_floor * mean, # e.g. +50% for size

abs_floor) # e.g. +500 ms, an absolute minimum

A change has to be statistically and practically significant.

def is_anomalous(value, mean, std, *, rel_floor, abs_floor, both_directions):

threshold = max(3 * std, rel_floor * mean, abs_floor)

delta = value - mean

flagged = abs(delta) > threshold if both_directions else delta > threshold

return flagged, ("up" if delta > 0 else "down")

No flapping. One weird check isn't an incident. I require two anomalous checks in a row, in the same direction, before alerting.

Warm-up guard. Below a minimum sample count (I use ~50), I don't alert at all — there's no trustworthy baseline yet.

Together these turn a noisy 3σ trigger into something that only fires when an endpoint genuinely behaves unlike itself.

SO WHERE DOES THE "AI" COME IN?

Here's the line I care about: detection is statistics; AI only explains.

Once the math flags something, I hand the numbers to a small LLM call to turn this:

payload size dropped from ~14 KB to ~800 B (−94%), 2 checks in a row

into this:

"This endpoint is likely returning an error or empty payload instead of its usual response — the body shrank by ~94% while still answering 200 OK."

The model writes the human sentence. It does not decide what's anomalous. I refuse to market a 3σ threshold as machine learning. Detection = math, explanation = language. Calling the whole thing "AI anomaly detection" would be a lie about which part is which.

TAKEAWAY

You don't need ML to catch "it's up but it's broken." A per-endpoint rolling baseline, a max(3σ, relative floor, absolute floor) threshold, a direction rule, and a two-in-a-row guard get you surprisingly far — and every alert stays fully explainable, which beats a black box when you're staring at it at 3 a.m.

This runs in PingMon (pingmon.de), the uptime monitor I'm building, but the technique is general — you can bolt it onto anything with a metric history. Happy to go deeper on the windowing or the per-tier cost controls in the comments.

— Dario

source & further reading

dev.to — original article Mastering Claude Code Configs: `CLAUDE.md` vs `.claude/rules/` How a Baseten Engineer Traced 7 Years of Attention Mechanism Evolution -- From GPT-2 to Kimi K3, in Runable PyTorch Why NVIDIA Open-Sourced Its Linux GPU Kernel Modules

~/api · this article 200

$curl api.wpnews.pro/v1/news/detecting-api-anomalies-…

Read original on dev.to → dev.to/dario_le/detecting-api-anomalies-behind-a…

mentioned entities

PingMon

metadata

slugdetecting-api-anomalies-behind-a-200-ok-with-statistics-not-ai

topic#developer-tools

secondary2 topics

sentimentpositive

canonicaldev.to

navigation

← prevDon't Do Your Taxes at a Party

next →[HIRING to start asap] AI/ML Sys…

── more in #developer-tools 4 stories · sorted by recency

dev.to · 31 Jul · #developer-tools

Top 5 Developer Tools & Tutorials - July 2026

dev.to · 31 Jul · #developer-tools

Mastering Claude Code Configs: `CLAUDE.md` vs `.claude/rules/`

github.com · 31 Jul · #developer-tools

Show HN: An AI skill for filtering and reading Hacker Newsletter and others

dev.to · 31 Jul · #developer-tools

How a Baseten Engineer Traced 7 Years of Attention Mechanism Evolution -- From GPT-2 to Kimi K3, in Runable PyTorch

── more on @pingmon 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 30 Jul · #artificial-intelligence

Oracle expands AI offerings with access to Google’s Gemini models, intensifying the cloud AI arms race

wpnews · 30 Jul · #artificial-intelligence

Apple to join Samsung in AI glasses race against Meta

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required