Drift Detection for LLM Routing: Catching Silent Model Degradation

wpnews.pro

It's 2am and I am staring at a routing layer I spent weeks tuning, running a thought experiment that will not let me sleep. The router is doing exactly what I built it to do. Nothing in my code would change, nothing in my config would change, and yet I can see, plain as day, the night this system goes confidently, repeatedly wrong while every line of it stays correct. The failure is already baked in. I just have not been bitten by it yet.

The setup is simple. I route incoming tasks across four capabilities: a fast cheap model, a slow expensive one, a retrieval tool, and a code-execution agent. Each task goes to one of them, and I watch a single binary signal, did the output pass the quality gate or not. Run that for a few thousand calls and the policy converges, the weights stabilize, the dispatcher learns which arm wins. For a while, life is good. And that good stretch is exactly the trap.

Here is the scenario that keeps me up. The fast cheap model gets silently updated by its vendor, and its accuracy on my tasks quietly collapses. My router has no idea. It is carrying a high historical success estimate for that arm, earned honestly over weeks of good performance, and it keeps routing there because three weeks ago that was the right call. The dispatcher would not be broken. It would be right about a world that no longer existed. It would be wrong because it remembered too well.

Multi-armed bandit routing, from Thompson Sampling to UCB to plain epsilon-greedy, rests on one quiet premise: each arm's true success rate is fixed. Pick the arm with the best estimate, keep nudging it as outcomes arrive, converge on the best option. Clean, and it works right up until the ground moves. In production LLM routing the ground moves constantly. Models get updated, prompts age, the external API you lean on degrades. The question was never whether drift would hit me. It was whether my routing layer would notice before my users did. By default, it wouldn't.

The cruel detail is the inertia. With a slow learning rate the running estimate remembers roughly the last twenty observations, which feels fast enough until you picture an arm that has served two thousand calls at 85 percent. A moving average with that much history behind it does not flinch when the truth drops to 30 percent. It takes dozens of fresh failures just to halve the gap, and far longer to close it, and every one of those failures is a task routed straight into the hole.

My first instinct was the obvious one, and it was wrong: turn up the learning rate, make the estimate forget faster. A faster learning rate makes every estimate jittery all the time, even on arms where nothing has changed. I would be trading one silent failure for a router that twitches at noise. That is not a fix, it is a different bug. What I actually needed was narrower. Not a shorter memory everywhere, but a way to forget on purpose, surgically, only on the one arm that had genuinely shifted, and only when a real shift had occurred. A tripwire per arm.

The tool already existed, and it had existed since 2007. ADWIN, for adaptive windowing, published by Bifet and Gavaldà at SIAM SDM 2007, does exactly the surgical thing I was reaching for. It watches a single stream of binary outcomes and keeps a window over them. After every new result it asks one question: is there a point inside this window where the older stretch and the recent stretch look like two different distributions, too far apart to be the same thing wearing noise?

If no, the window just grows. Stable periods accumulate evidence, and a bigger window makes the test harder to fool, so it does not trip on ordinary variance. If yes, ADWIN declares drift, throws away everything before the split, keeps only the recent post-shift stretch, and tells you, so you reset the arm's estimate using only what survived. That collapse is the whole idea. A fixed window can only notice a shift once it has been present for about half its length, so you are always looking backward at a horizon you had to guess in advance. ADWIN's window grows without bound while things are calm, building the power to resist false alarms, then collapses hard the instant a real shift lands. The window size is an answer the data gives you, not a knob you set and pray over. For a bandit with several arms, I run one independent ADWIN per reward stream; they share nothing, because arms do not generally degrade at the same moment, and each one watching its own stream in isolation is not a simplification, it is the correct model. A single sensitivity knob governs how eager the test is to fire, really the false-positive rate you will tolerate on a stable stream. Under the hood its tolerance band tightens when the split is balanced and is nearly impossible to trip when only one or two observations sit on one side, so a single fresh result never declares drift on its own; it carries a gentle penalty for checking every possible split point; and it tightens further on a low-variance arm that almost always succeeds, so real degradation on a near-perfect arm gets caught sooner rather than hiding in slack. The river library uses a variance-aware form of the bound that runs tighter for large windows and high-quality arms than the simpler version in the original 2007 paper; the two agree closely when the window is very small. For routing I settled on a sensitivity of 0.002. With a window around a thousand observations that keeps spurious firings well under one per five hundred evaluations per arm, which across four arms at a hundred routing decisions an hour is a false drift event roughly once every thirty hours, low enough not to pollute the policy, high enough that I am not waiting weeks to catch the real thing.

The original authors prove two guarantees, and both held up. On a stationary stream the odds of a false alarm stay bounded by the sensitivity setting. And once a real shift of a given size lands, ADWIN catches it within a number of observations that scales inversely with the square of the shift magnitude, so a big drop is caught fast and a subtle one takes proportionally longer. Both bounds are tight.

I did not want to trust any of this on faith, so I built a small synthetic run to validate the design before it ever touched a real reward path. Four arms, five hundred steps, a fixed seed so it reproduces. Three arms hold steady at roughly 0.70, 0.65, and 0.60. The fourth, capability A, starts strong at 0.85 and then, at step 300, drops off a cliff to 0.25, standing in for exactly the silent vendor update I had been losing sleep over.

The first time I saw the log line appear, the feeling was disproportionate to a synthetic test. Around step 318, eighteen steps after the true shift, ADWIN fired on capability A. Its estimate dropped from about 0.85 to about 0.28 as the window collapsed from over three hundred observations down to roughly a dozen, dumping the stale high-accuracy history in one motion. A second, smaller event near step 412 was just the window settling onto the new regime. By the end, capability A's routing weight had fallen from near 0.40 before the drift to around 0.11, its honest post-drift share, while the three stable arms held near 0.25 to 0.29 each. The eighteen-step lag lines up with the guarantee: the shift here is 0.60, which puts the theoretical floor at only a handful of observations, and constants and warm-up account for the rest.

One design choice mattered more than the rest. The arm's estimate has to be decoupled from ADWIN's internal window: when drift fires, the monitor reads the fresh collapsed window's mean and adopts it as the new estimate, and that is the fast policy refresh the whole exercise exists to produce. The other thing I refused to give up was that no arm is ever zeroed out. A tiny floor on every weight keeps a sliver of exploration alive even for a failing arm, seeded at an uninformed 0.5 when it is new, so the system keeps probing it and can notice the day it recovers. A degraded arm is not a dead arm. Sometimes the vendor ships a fix and you want to find out.

A clean synthetic result does not make ADWIN a hammer for every problem, and pretending otherwise is how you get burned somewhere else.

If an arm sees fewer than thirty observations an hour, the tolerance band is enormous, only a total collapse trips it, and the post-drift estimate is too noisy to trust anyway. Aggregate at a coarser granularity or reach for a Bayesian change-point detector with an informative prior.

If the drift is gradual, ADWIN is the wrong tool by design. It is built for abrupt shifts, a model update, an endpoint degrading, a sudden change in the prompt mix. For an arm whose success rate decays a percent a week as its world knowledge ages, the window grows slowly, the gap between old and recent stays narrow, and detection lags by months. A scheduled two-sample test on rolling seven-day buckets is what that job wants.

And if your non-stationarity is structural, if reward correlates with time of day or task type or session by design, ADWIN will fire constantly and churn your estimates into noise. That is not a routing system with drift detection. That is a contextual bandit in denial, and the fix is to model the context explicitly rather than detect it as drift.

You don't have to implement any of this yourself. ADWIN ships maintained in the river online-machine-learning library; the work is wiring one detector per arm into the reward path and letting a drift signal reset that arm's estimate to its post-collapse window. I kept the companion implementation small, one monitor, one synthetic run, an optional plot, and a test suite above 80 percent coverage, with river as the only real dependency, so the whole thing reproduces from a single command.

But the wiring is not the lesson. The lesson is the one that kept me up at 2am: a router that learns is also a router that can be confidently, durably wrong the moment the world it learned stops being true. Convergence is not the finish line. A converged policy is a strong opinion about a fixed reality, and production has no fixed reality. A dispatcher that was right three weeks ago and wrong tonight is not malfunctioning. It is believing its own history a little too hard, and nothing in the loop is watching for the day that history stops being true.

So now something is. One tripwire per arm, quietly asking after every outcome whether the past still predicts the present, ready to forget on purpose the moment it doesn't. If you route anything across capabilities that can change underneath you, and in this field everything can, you want that tripwire in the loop before the page arrives, not after.

Bifet, A., & Gavaldà, R. (2007). Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining (SDM 2007), pp. 443-448. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611972771.42

river Python library. Online machine learning in Python. BSD-3-Clause license. Source: github.com/online-ml/river. ADWIN implementation: river.drift.ADWIN

.

source & further reading

dev.to — original article Context Architecture: the day I realized the whole repo is the context Context Architecture: el día que entendí que el repo entero es el contexto GameJam Cipher

Drift Detection for LLM Routing: Catching Silent Model Degradation

Run your AI side-project on zahid.host