{"slug": "i-built-a-system-that-automatically-rolls-back-ml-models-before-they-ruin-your", "title": "I Built a System that Automatically Rolls Back ML Models Before they Ruin Your Production Data", "summary": "An ML engineer built a system that automatically rolls back machine learning models before they harm production data by routing a small percentage of live traffic to a canary model, continuously measuring its error rate and latency against the baseline, and triggering an automatic rollback if the canary underperforms. The system uses FastAPI, Redis, Prometheus, and PostgreSQL to enable instant traffic splits and automated decisions, preventing widespread user impact from faulty model deployments.", "body_md": "There’s a specific kind of incident that every ML team eventually lives through.\n\nYou ship a new model version. The offline metrics looked great — better accuracy, better F1, everyone signs off. It goes to production. And then, slowly, something is wrong. Predictions drift. Error rates creep up. Maybe latency balloons because the new model is heavier. You don’t find out from a dashboard — you find out from a customer, or a downstream report that looks off, days later.\n\nThe root problem: **deploying a model is usually a binary, all-or-nothing switch.** Old model out, new model in, 100% of traffic, instantly. If the new one is bad, 100% of your users feel it, and rolling back is a frantic manual scramble.\n\nI built a system that makes that switch safe and automatic. New models get only a small slice of live traffic. Their real-world error rate and latency are measured against the current production model continuously. And if the new model is worse, it’s **rolled back automatically** — before most of your users ever touch it.\n\nHere’s how it works.\n\nInstead of replacing the old model outright, you route a small percentage of live traffic — say 20% — to the new “canary” model, and keep 80% on the proven baseline. You watch both. If the canary holds up, you ramp it to 50%, then 100%, then promote it. If it doesn’t, you roll back. The 80% never noticed.\n\nThe name comes from “canary in a coal mine” — a small, expendable early warning. The canary model takes the risk so your whole user base doesn’t have to.\n\nSimple idea. The engineering is in making the split *instant*, the measurement *honest*, and the decision *automatic*.\n\n```\n                       Prediction request                              │                  POST /predict/{deployment}                              │                     ┌────────┴────────┐                     │  TrafficRouter  │──reads──▶ Redis (canary_traffic_pct)                     └────────┬────────┘                  80%         │         20%                   ▼                     ▼            [ v1 Baseline ]        [ v2 Canary ]                   │                     │                   └──── Prometheus ─────┘                     (requests, errors, latency)                              │                  HealthChecker (every 30s)                     ┌──────────┴──────────┐                 CRITICAL?              HEALTHY?                     ▼                     ▼              Auto-rollback        Continue / Auto-promote                              │                    DeploymentEvent ─▶ PostgreSQL (audit trail)\n```\n\nFastAPI serves predictions. Redis holds the traffic split. Prometheus records what happened. A background health checker compares the two models and acts. PostgreSQL keeps the full audit trail. MinIO stores the model artifacts.\n\nThe single most important design decision: **the traffic split is a Redis value, not a config file or an environment variable.**\n\n``` php\ndef route(self, deployment_config: dict) -> str:    canary_pct = float(deployment_config.get(\"canary_traffic_pct\", 0.0))    if canary_pct <= 0.0 or not deployment_config.get(\"canary_model_id\"):        return \"baseline\"    if canary_pct >= 100.0:        return \"canary\"    return \"canary\" if random.random() < (canary_pct / 100.0) else \"baseline\"\n```\n\nThis function runs on every single prediction request, and it does no I/O beyond a single Redis hash read. Because the split lives in Redis, changing it from 20% to 50% takes effect **instantly** — no restart, no redeploy, no dropped requests. You click “+10% Traffic” in the dashboard and the very next prediction is routed against the new ratio.\n\nI verified the distribution is honest with 100,000 samples per setting: a 20% config routes 20.2% to canary, 50% routes 49.9%, 100% routes 100%. A missing or malformed config safely falls back to baseline — the production model is always the safe default.\n\nOffline accuracy is not what kills you in production. What kills you is **error rate** (the model throwing exceptions on real inputs it never saw in training) and **latency** (the new model being too slow under real load).\n\nSo every prediction records three things to Prometheus, labeled by model role:\n\n```\nprediction_requests_total{deployment, model_version, model_role, status}   # counterprediction_latency_ms{deployment, model_version, model_role}               # histogramcanary_traffic_pct{deployment}                                             # gauge\n```\n\nThe histogram is the interesting one. Prometheus histograms store latency as cumulative buckets, not raw values — so to get a p95 back out, you interpolate across the buckets. That lets the health checker ask “what’s the canary’s p95 latency over the last few minutes?” and compare it directly against the baseline’s.\n\nThis is the heart of the system. Every 30 seconds, a background task evaluates each running canary against its baseline over a 5-minute window and applies a simple, explicit rule set:\n\n```\nCRITICAL  (→ auto-rollback):  canary.error_rate > baseline.error_rate + 5%  OR canary.p95_latency > baseline.p95_latency + 100msDEGRADED  (→ warn, hold):  same conditions at half the thresholdHEALTHY   (→ auto-promote, if ≥100 canary requests and canary error ≤ baseline)          (→ otherwise continue)\n```\n\nThe thresholds are deltas against the baseline, not absolute numbers — because “a 3% error rate” means nothing without knowing the baseline is at 0.2%. What matters is whether the canary is *worse than what you already have*.\n\nIf the verdict is CRITICAL and auto-rollback is enabled, the engine rolls the canary back on its own, logs an auto_rollback_triggered event, and the baseline keeps serving. No page, no human, no incident.\n\nI tested all five decision paths in isolation with controlled metric inputs — critical-via-error, critical-via-latency, degraded, healthy-promote, healthy-continue — and each produces exactly the right call.\n\nThe end-to-end demo tells the whole story on the UCI Heart Disease dataset, and these are real numbers from an actual run:\n\n**Step 1–2** — Train two models. A strong baseline v1 (accuracy **0.852**) and an intentionally weak canary v2 (accuracy **0.541** — underfit on purpose).\n\n**Step 3–4** — Create the heart-prod deployment with v1 as baseline, then start v2 as a canary at **20% traffic**.\n\n**Step 5** — Send 200 live prediction requests. They split ~157 to baseline, ~43 to canary — right around the 20% target. A handful of malformed requests hit the canary and cause it to error.\n\n**Step 6** — The health check runs:\n\n```\nHealth Status: CRITICALBaseline: error_rate=0.0%,  p95=8.8msCanary:   error_rate=18.9%, p95=8.3msRecommendation: ROLLBACK\n```\n\n**Step 7** — The auto-decision engine acts:\n\n```\naction=rolled_back; deployment status now: stableauto-rollback triggered: v2 removed, v1 continues as baseline\n```\n\nThe whole arc — canary_started → health_check_failed → rolled_back → auto_rollback_triggered — is captured as an immutable event timeline in PostgreSQL. The canary was caught and removed in under a minute, and the baseline never stopped serving.\n\nA prediction request does a lot, fast: read the Redis split, route to baseline or canary, load the model (cached in memory after the first hit), run inference with timing, record Prometheus metrics, and log the prediction to PostgreSQL — including a SHA-256 hash of the inputs rather than the raw features, so no PII is stored.\n\nThe model cache matters. The first request after startup takes ~5 seconds — it’s pulling the serialized model out of MinIO and deserializing it. Every request after that is served from memory in **2–3 milliseconds**. The download happens once; the speed is permanent.\n\nFour pages, all live over the REST API:\n\nThe Deployment History page is the one I’d show a skeptic. It’s the receipt: the system noticed a problem and fixed it, timestamped, with the reason attached.\n\nMLflow is great at tracking experiments and registering model versions. But a registry tells you a model *exists* and what its offline metrics *were*. It doesn’t route live traffic, it doesn’t measure real-world error rate, and it doesn’t roll anything back. The registry is the “what.” This system is the “how do I ship it without getting hurt.”\n\nYou need both — and crucially, you need the canary *and* the health checker together. A canary without an automated decision engine is just slow manual testing: you’ve split the traffic, but a human still has to stare at graphs and decide. The automation is what makes it operationally real.\n\nThis is the third piece of a connected MLOps platform:\n\nTogether: fast feature serving on the front end, full traceability in the middle, and safe automated deployment at the edge. When a canary gets rolled back, you can trace *why* the model was bad all the way back to the data it learned from.\n\nEverything is open source: **github.com/Emart29/ml-canary-deploy**\n\nThe demo (examples/heart_disease/demo.py) runs the entire story — train, deploy, observe, auto-rollback — against a real Postgres + Redis + MinIO stack in under a minute. There's a 10-command CLI and a 4-page Streamlit dashboard on top of it.\n\n*Next: streaming feature pipelines with Kafka — computing and serving features in real time as events arrive, instead of in scheduled batches.*\n\n[I Built a System that Automatically Rolls Back ML Models Before they Ruin Your Production Data](https://pub.towardsai.net/i-built-a-system-that-automatically-rolls-back-ml-models-before-they-ruin-your-production-data-189f4229d40a) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/i-built-a-system-that-automatically-rolls-back-ml-models-before-they-ruin-your", "canonical_source": "https://pub.towardsai.net/i-built-a-system-that-automatically-rolls-back-ml-models-before-they-ruin-your-production-data-189f4229d40a?source=rss----98111c9905da---4", "published_at": "2026-06-24 04:07:14+00:00", "updated_at": "2026-06-24 04:24:45.950084+00:00", "lang": "en", "topics": ["machine-learning", "mlops", "ai-infrastructure", "ai-safety", "ai-tools"], "entities": ["FastAPI", "Redis", "Prometheus", "PostgreSQL", "MinIO"], "alternates": {"html": "https://wpnews.pro/news/i-built-a-system-that-automatically-rolls-back-ml-models-before-they-ruin-your", "markdown": "https://wpnews.pro/news/i-built-a-system-that-automatically-rolls-back-ml-models-before-they-ruin-your.md", "text": "https://wpnews.pro/news/i-built-a-system-that-automatically-rolls-back-ml-models-before-they-ruin-your.txt", "jsonld": "https://wpnews.pro/news/i-built-a-system-that-automatically-rolls-back-ml-models-before-they-ruin-your.jsonld"}}