Thereβs a specific kind of incident that every ML team eventually lives through.
You ship a new model version. The offline metrics looked great β better accuracy, better F1, everyone signs off. It goes to production. And then, slowly, something is wrong. Predictions drift. Error rates creep up. Maybe latency balloons because the new model is heavier. You donβt find out from a dashboard β you find out from a customer, or a downstream report that looks off, days later.
The root problem: deploying a model is usually a binary, all-or-nothing switch. Old model out, new model in, 100% of traffic, instantly. If the new one is bad, 100% of your users feel it, and rolling back is a frantic manual scramble.
I built a system that makes that switch safe and automatic. New models get only a small slice of live traffic. Their real-world error rate and latency are measured against the current production model continuously. And if the new model is worse, itβs rolled back automatically β before most of your users ever touch it.
Hereβs how it works.
Instead of replacing the old model outright, you route a small percentage of live traffic β say 20% β to the new βcanaryβ model, and keep 80% on the proven baseline. You watch both. If the canary holds up, you ramp it to 50%, then 100%, then promote it. If it doesnβt, you roll back. The 80% never noticed.
The name comes from βcanary in a coal mineβ β a small, expendable early warning. The canary model takes the risk so your whole user base doesnβt have to.
Simple idea. The engineering is in making the split instant, the measurement honest, and the decision automatic.
Prediction request β POST /predict/{deployment} β ββββββββββ΄βββββββββ β TrafficRouter βββreadsβββΆ Redis (canary_traffic_pct) ββββββββββ¬βββββββββ 80% β 20% βΌ βΌ [ v1 Baseline ] [ v2 Canary ] β β βββββ Prometheus ββββββ (requests, errors, latency) β HealthChecker (every 30s) ββββββββββββ΄βββββββββββ CRITICAL? HEALTHY? βΌ βΌ Auto-rollback Continue / Auto-promote β DeploymentEvent ββΆ PostgreSQL (audit trail)
FastAPI serves predictions. Redis holds the traffic split. Prometheus records what happened. A background health checker compares the two models and acts. PostgreSQL keeps the full audit trail. MinIO stores the model artifacts.
The single most important design decision: the traffic split is a Redis value, not a config file or an environment variable.
def route(self, deployment_config: dict) -> str: canary_pct = float(deployment_config.get("canary_traffic_pct", 0.0)) if canary_pct <= 0.0 or not deployment_config.get("canary_model_id"): return "baseline" if canary_pct >= 100.0: return "canary" return "canary" if random.random() < (canary_pct / 100.0) else "baseline"
This function runs on every single prediction request, and it does no I/O beyond a single Redis hash read. Because the split lives in Redis, changing it from 20% to 50% takes effect instantly β no restart, no redeploy, no dropped requests. You click β+10% Trafficβ in the dashboard and the very next prediction is routed against the new ratio.
I verified the distribution is honest with 100,000 samples per setting: a 20% config routes 20.2% to canary, 50% routes 49.9%, 100% routes 100%. A missing or malformed config safely falls back to baseline β the production model is always the safe default.
Offline accuracy is not what kills you in production. What kills you is error rate (the model throwing exceptions on real inputs it never saw in training) and latency (the new model being too slow under real load).
So every prediction records three things to Prometheus, labeled by model role:
prediction_requests_total{deployment, model_version, model_role, status} # counterprediction_latency_ms{deployment, model_version, model_role} # histogramcanary_traffic_pct{deployment} # gauge
The histogram is the interesting one. Prometheus histograms store latency as cumulative buckets, not raw values β so to get a p95 back out, you interpolate across the buckets. That lets the health checker ask βwhatβs the canaryβs p95 latency over the last few minutes?β and compare it directly against the baselineβs.
This is the heart of the system. Every 30 seconds, a background task evaluates each running canary against its baseline over a 5-minute window and applies a simple, explicit rule set:
CRITICAL (β auto-rollback): canary.error_rate > baseline.error_rate + 5% OR canary.p95_latency > baseline.p95_latency + 100msDEGRADED (β warn, hold): same conditions at half the thresholdHEALTHY (β auto-promote, if β₯100 canary requests and canary error β€ baseline) (β otherwise continue)
The thresholds are deltas against the baseline, not absolute numbers β because βa 3% error rateβ means nothing without knowing the baseline is at 0.2%. What matters is whether the canary is worse than what you already have.
If the verdict is CRITICAL and auto-rollback is enabled, the engine rolls the canary back on its own, logs an auto_rollback_triggered event, and the baseline keeps serving. No page, no human, no incident.
I tested all five decision paths in isolation with controlled metric inputs β critical-via-error, critical-via-latency, degraded, healthy-promote, healthy-continue β and each produces exactly the right call.
The end-to-end demo tells the whole story on the UCI Heart Disease dataset, and these are real numbers from an actual run:
Step 1β2 β Train two models. A strong baseline v1 (accuracy 0.852) and an intentionally weak canary v2 (accuracy 0.541 β underfit on purpose).
Step 3β4 β Create the heart-prod deployment with v1 as baseline, then start v2 as a canary at 20% traffic.
Step 5 β Send 200 live prediction requests. They split ~157 to baseline, ~43 to canary β right around the 20% target. A handful of malformed requests hit the canary and cause it to error.
Step 6 β The health check runs:
Health Status: CRITICALBaseline: error_rate=0.0%, p95=8.8msCanary: error_rate=18.9%, p95=8.3msRecommendation: ROLLBACK
Step 7 β The auto-decision engine acts:
action=rolled_back; deployment status now: stableauto-rollback triggered: v2 removed, v1 continues as baseline
The whole arc β canary_started β health_check_failed β rolled_back β auto_rollback_triggered β is captured as an immutable event timeline in PostgreSQL. The canary was caught and removed in under a minute, and the baseline never stopped serving.
A prediction request does a lot, fast: read the Redis split, route to baseline or canary, load the model (cached in memory after the first hit), run inference with timing, record Prometheus metrics, and log the prediction to PostgreSQL β including a SHA-256 hash of the inputs rather than the raw features, so no PII is stored.
The model cache matters. The first request after startup takes ~5 seconds β itβs pulling the serialized model out of MinIO and deserializing it. Every request after that is served from memory in 2β3 milliseconds. The download happens once; the speed is permanent.
Four pages, all live over the REST API:
The Deployment History page is the one Iβd show a skeptic. Itβs the receipt: the system noticed a problem and fixed it, timestamped, with the reason attached.
MLflow is great at tracking experiments and registering model versions. But a registry tells you a model exists and what its offline metrics were. It doesnβt route live traffic, it doesnβt measure real-world error rate, and it doesnβt roll anything back. The registry is the βwhat.β This system is the βhow do I ship it without getting hurt.β
You need both β and crucially, you need the canary and the health checker together. A canary without an automated decision engine is just slow manual testing: youβve split the traffic, but a human still has to stare at graphs and decide. The automation is what makes it operationally real.
This is the third piece of a connected MLOps platform:
Together: fast feature serving on the front end, full traceability in the middle, and safe automated deployment at the edge. When a canary gets rolled back, you can trace why the model was bad all the way back to the data it learned from.
Everything is open source: github.com/Emart29/ml-canary-deploy
The demo (examples/heart_disease/demo.py) runs the entire story β train, deploy, observe, auto-rollback β against a real Postgres + Redis + MinIO stack in under a minute. There's a 10-command CLI and a 4-page Streamlit dashboard on top of it.
Next: streaming feature pipelines with Kafka β computing and serving features in real time as events arrive, instead of in scheduled batches.
I Built a System that Automatically Rolls Back ML Models Before they Ruin Your Production Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.