I Built a System that Automatically Rolls Back ML Models Before they Ruin Your Production Data

wpnews.pro

There’s a specific kind of incident that every ML team eventually lives through.

You ship a new model version. The offline metrics looked great — better accuracy, better F1, everyone signs off. It goes to production. And then, slowly, something is wrong. Predictions drift. Error rates creep up. Maybe latency balloons because the new model is heavier. You don’t find out from a dashboard — you find out from a customer, or a downstream report that looks off, days later.

The root problem: deploying a model is usually a binary, all-or-nothing switch. Old model out, new model in, 100% of traffic, instantly. If the new one is bad, 100% of your users feel it, and rolling back is a frantic manual scramble.

I built a system that makes that switch safe and automatic. New models get only a small slice of live traffic. Their real-world error rate and latency are measured against the current production model continuously. And if the new model is worse, it’s rolled back automatically — before most of your users ever touch it.

Here’s how it works.

Instead of replacing the old model outright, you route a small percentage of live traffic — say 20% — to the new “canary” model, and keep 80% on the proven baseline. You watch both. If the canary holds up, you ramp it to 50%, then 100%, then promote it. If it doesn’t, you roll back. The 80% never noticed.

The name comes from “canary in a coal mine” — a small, expendable early warning. The canary model takes the risk so your whole user base doesn’t have to.

Simple idea. The engineering is in making the split instant, the measurement honest, and the decision automatic.

                       Prediction request                              │                  POST /predict/{deployment}                              │                     ┌────────┴────────┐                     │  TrafficRouter  │──reads──▶ Redis (canary_traffic_pct)                     └────────┬────────┘                  80%         │         20%                   ▼                     ▼            [ v1 Baseline ]        [ v2 Canary ]                   │                     │                   └──── Prometheus ─────┘                     (requests, errors, latency)                              │                  HealthChecker (every 30s)                     ┌──────────┴──────────┐                 CRITICAL?              HEALTHY?                     ▼                     ▼              Auto-rollback        Continue / Auto-promote                              │                    DeploymentEvent ─▶ PostgreSQL (audit trail)

FastAPI serves predictions. Redis holds the traffic split. Prometheus records what happened. A background health checker compares the two models and acts. PostgreSQL keeps the full audit trail. MinIO stores the model artifacts.

The single most important design decision: the traffic split is a Redis value, not a config file or an environment variable.

def route(self, deployment_config: dict) -> str:    canary_pct = float(deployment_config.get("canary_traffic_pct", 0.0))    if canary_pct <= 0.0 or not deployment_config.get("canary_model_id"):        return "baseline"    if canary_pct >= 100.0:        return "canary"    return "canary" if random.random() < (canary_pct / 100.0) else "baseline"

This function runs on every single prediction request, and it does no I/O beyond a single Redis hash read. Because the split lives in Redis, changing it from 20% to 50% takes effect instantly — no restart, no redeploy, no dropped requests. You click “+10% Traffic” in the dashboard and the very next prediction is routed against the new ratio.

I verified the distribution is honest with 100,000 samples per setting: a 20% config routes 20.2% to canary, 50% routes 49.9%, 100% routes 100%. A missing or malformed config safely falls back to baseline — the production model is always the safe default.

Offline accuracy is not what kills you in production. What kills you is error rate (the model throwing exceptions on real inputs it never saw in training) and latency (the new model being too slow under real load).

So every prediction records three things to Prometheus, labeled by model role:

prediction_requests_total{deployment, model_version, model_role, status}   # counterprediction_latency_ms{deployment, model_version, model_role}               # histogramcanary_traffic_pct{deployment}                                             # gauge

The histogram is the interesting one. Prometheus histograms store latency as cumulative buckets, not raw values — so to get a p95 back out, you interpolate across the buckets. That lets the health checker ask “what’s the canary’s p95 latency over the last few minutes?” and compare it directly against the baseline’s.

This is the heart of the system. Every 30 seconds, a background task evaluates each running canary against its baseline over a 5-minute window and applies a simple, explicit rule set:

CRITICAL  (→ auto-rollback):  canary.error_rate > baseline.error_rate + 5%  OR canary.p95_latency > baseline.p95_latency + 100msDEGRADED  (→ warn, hold):  same conditions at half the thresholdHEALTHY   (→ auto-promote, if ≥100 canary requests and canary error ≤ baseline)          (→ otherwise continue)

The thresholds are deltas against the baseline, not absolute numbers — because “a 3% error rate” means nothing without knowing the baseline is at 0.2%. What matters is whether the canary is worse than what you already have.

If the verdict is CRITICAL and auto-rollback is enabled, the engine rolls the canary back on its own, logs an auto_rollback_triggered event, and the baseline keeps serving. No page, no human, no incident.

I tested all five decision paths in isolation with controlled metric inputs — critical-via-error, critical-via-latency, degraded, healthy-promote, healthy-continue — and each produces exactly the right call.

The end-to-end demo tells the whole story on the UCI Heart Disease dataset, and these are real numbers from an actual run:

Step 1–2 — Train two models. A strong baseline v1 (accuracy 0.852) and an intentionally weak canary v2 (accuracy 0.541 — underfit on purpose).

Step 3–4 — Create the heart-prod deployment with v1 as baseline, then start v2 as a canary at 20% traffic.

Step 5 — Send 200 live prediction requests. They split ~157 to baseline, ~43 to canary — right around the 20% target. A handful of malformed requests hit the canary and cause it to error.

Step 6 — The health check runs:

Health Status: CRITICALBaseline: error_rate=0.0%,  p95=8.8msCanary:   error_rate=18.9%, p95=8.3msRecommendation: ROLLBACK

Step 7 — The auto-decision engine acts:

action=rolled_back; deployment status now: stableauto-rollback triggered: v2 removed, v1 continues as baseline

The whole arc — canary_started → health_check_failed → rolled_back → auto_rollback_triggered — is captured as an immutable event timeline in PostgreSQL. The canary was caught and removed in under a minute, and the baseline never stopped serving.

A prediction request does a lot, fast: read the Redis split, route to baseline or canary, load the model (cached in memory after the first hit), run inference with timing, record Prometheus metrics, and log the prediction to PostgreSQL — including a SHA-256 hash of the inputs rather than the raw features, so no PII is stored.

The model cache matters. The first request after startup takes ~5 seconds — it’s pulling the serialized model out of MinIO and deserializing it. Every request after that is served from memory in 2–3 milliseconds. The download happens once; the speed is permanent.

Four pages, all live over the REST API:

The Deployment History page is the one I’d show a skeptic. It’s the receipt: the system noticed a problem and fixed it, timestamped, with the reason attached.

MLflow is great at tracking experiments and registering model versions. But a registry tells you a model exists and what its offline metrics were. It doesn’t route live traffic, it doesn’t measure real-world error rate, and it doesn’t roll anything back. The registry is the “what.” This system is the “how do I ship it without getting hurt.”

You need both — and crucially, you need the canary and the health checker together. A canary without an automated decision engine is just slow manual testing: you’ve split the traffic, but a human still has to stare at graphs and decide. The automation is what makes it operationally real.

This is the third piece of a connected MLOps platform:

Together: fast feature serving on the front end, full traceability in the middle, and safe automated deployment at the edge. When a canary gets rolled back, you can trace why the model was bad all the way back to the data it learned from.

Everything is open source: github.com/Emart29/ml-canary-deploy

The demo (examples/heart_disease/demo.py) runs the entire story — train, deploy, observe, auto-rollback — against a real Postgres + Redis + MinIO stack in under a minute. There's a 10-command CLI and a 4-page Streamlit dashboard on top of it.

Next: streaming feature pipelines with Kafka — computing and serving features in real time as events arrive, instead of in scheduled batches.

I Built a System that Automatically Rolls Back ML Models Before they Ruin Your Production Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article The 5 RAG Architectures and Exactly When to Use Each One in Production Three Eras of Quantitative Finance: How Rule-Based, ML, and Deep Learning Models React to the Same… How to Securely Connect Your AI Agent to Telegram with Azure

I Built a System that Automatically Rolls Back ML Models Before they Ruin Your Production Data

Run your AI side-project on zahid.host