Most uptime monitors answer one question: is it up or down? But some of the worst incidents I've dealt with returned a perfectly happy 200 OK:
an endpoint that started serving a cached error page
a JSON API returning {"error": ...} with status 200 a response that quietly got 10× slower
a payload that dropped from 14 KB to 800 bytes because a backend started returning empty results. A plain up/down check sails straight past all of these. I wanted my monitor to notice "it's up, but it's wrong." Here's how I built that — and why I deliberately didn't reach for machine learning (or the word "AI").
THE TEMPTATION, AND WHY I SKIPPED IT
The buzzword move is "AI-powered anomaly detection." But for per-endpoint metrics, ML is mostly overkill: you need training data, the model is opaque, and it's hard to explain why something fired. Plain statistics are simpler, cheaper, deterministic, and — importantly — explainable. So that's what I used.
ONE BASELINE PER ENDPOINT
The key decision: every endpoint is its own baseline. A CDN-cached 2 KB JSON response and a 500 KB HTML page have nothing in common, so a global threshold is meaningless. I track two signals per endpoint:
response size (bytes)
response time (ms)
For each, I keep a rolling baseline and ask: is the latest value weird for this endpoint?
THE MATH: ROLLING MEAN, STD, AND 3Σ
Standard stuff — flag a value when it's more than three standard deviations from the mean:
|value − mean| > 3 · σ
The trick is computing it cheaply. I don't want to load an endpoint's entire history on every check. Mean and variance only need three running aggregates — count, sum, and sum of squares — which is a single SQL query:
SELECT COUNT(*) AS n,
SUM(value) AS s,
SUM(value * value) AS q
FROM signal
WHERE endpoint_id = ?
AND created_at >= ?; -- rolling window Then:
mean = s / n
variance = max(0.0, q / n - mean * mean) std = variance ** 0.5
No history transfer, no model, just three numbers.
THE GUARDRAILS (WHERE MOST OF THE REAL WORK IS)
Raw 3σ is noisy. The interesting part is stopping false positives:
threshold = max(3 * std, # statistical rel_floor * mean, # e.g. +50% for size
abs_floor) # e.g. +500 ms, an absolute minimum
A change has to be statistically and practically significant.
def is_anomalous(value, mean, std, *, rel_floor, abs_floor, both_directions):
threshold = max(3 * std, rel_floor * mean, abs_floor)
delta = value - mean
flagged = abs(delta) > threshold if both_directions else delta > threshold
return flagged, ("up" if delta > 0 else "down")
No flapping. One weird check isn't an incident. I require two anomalous checks in a row, in the same direction, before alerting.
Warm-up guard. Below a minimum sample count (I use ~50), I don't alert at all — there's no trustworthy baseline yet.
Together these turn a noisy 3σ trigger into something that only fires when an endpoint genuinely behaves unlike itself.
SO WHERE DOES THE "AI" COME IN?
Here's the line I care about: detection is statistics; AI only explains.
Once the math flags something, I hand the numbers to a small LLM call to turn this:
payload size dropped from ~14 KB to ~800 B (−94%), 2 checks in a row
into this:
"This endpoint is likely returning an error or empty payload instead of its usual response — the body shrank by ~94% while still answering 200 OK."
The model writes the human sentence. It does not decide what's anomalous. I refuse to market a 3σ threshold as machine learning. Detection = math, explanation = language. Calling the whole thing "AI anomaly detection" would be a lie about which part is which.
TAKEAWAY
You don't need ML to catch "it's up but it's broken." A per-endpoint rolling baseline, a max(3σ, relative floor, absolute floor) threshold, a direction rule, and a two-in-a-row guard get you surprisingly far — and every alert stays fully explainable, which beats a black box when you're staring at it at 3 a.m.
This runs in PingMon (pingmon.de), the uptime monitor I'm building, but the technique is general — you can bolt it onto anything with a metric history. Happy to go deeper on the windowing or the per-tier cost controls in the comments.
— Dario