Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems A developer argues that autonomous agent systems suffer from 'drift'—a slow, silent degradation in output quality that standard monitoring fails to detect. They propose a solution combining continuous scoring via agent-eval and trace analysis with AgentLens to identify which component is causing the decay. There is a specific kind of incident that no alert ever fires for, and it is the one I trust least. Nothing crashed. No exception, no 500, no failed health check. The agent ran every day, returned answers every time, and stayed green on every dashboard you own. And yet, over six weeks, it got measurably worse — and you found out from a customer, not a monitor. That is drift, and it is the failure mode I think the industry is least prepared for. We have gotten good at catching the cliff : the agent throws, the tool 500s, the JSON won't parse, CI goes red. We are still terrible at catching the slope : answer quality bleeding out two percent a week while every system reports perfect health. Crashes are loud and self-announcing. Drift is silent by construction, and that silence is exactly why it wins. Here is the opinion I will defend: drift is not an outlier problem, it's a baseline problem. You cannot detect decay by looking at any single run, because a single run looks completely fine. Drift only exists as a change in a distribution over time — so if you are not continuously scoring production and trending the score, you are structurally incapable of seeing it. Not unlucky. Incapable. The thing that makes drift so disorienting is that it violates our deepest instinct: if the code didn't change, the behavior didn't change. For agents, that is just wrong. Your agent decays while your git history sits perfectly still: gpt-4o , but a pinned model name is not a pinned model — providers roll checkpoints and quietly re-tune behind a stable string. Your prompt is byte-for-byte identical and your outputs shifted anyway.Not one of these shows up in a code diff. Not one throws. Every one degrades what your users actually experience. This is why "we'll notice if it breaks" is a fantasy — the most expensive agent regressions don't break anything. To detect drift you need two things: a baseline — what "normal" scored like over a trusted window — and a continuous signal , the same score computed the same way on live traffic. Drift is the gap between them, measured statistically, not by eyeball. The naive version is a single threshold: "alert if quality drops below 0.8." That catches the cliff and misses the slope. A score that walks from 0.91 to 0.82 over five weeks never trips an absolute floor, yet it has lost nearly a tenth of its quality. You are not looking for low ; you are looking for moving — a different statistical question, and it needs the baseline. This is where evaluation and observability stop being separate concerns and become one workflow — because you need both a thing that scores and a thing that remembers the route . I run agent-eval to score and gate the agent's output: deterministic checks where it can, a model-as-judge rubric where it must, and crucially it persists each verdict so a series of scores exists to trend at all. And I run AgentLens to capture the trace behind every scored run — every model and tool step, the resolved inputs the model actually saw after interpolation, the raw outputs that came back. The pairing is the whole point: agent-eval tells you the score is drifting; AgentLens tells you which step started drifting. A drift alert with no trace behind it is just a number falling on a chart with no way to ask why — and "quality is down 6% this month, cause unknown" isn't an actionable signal, it's an anxiety generator. Here is a drift detector over a rolling window of scored production runs. The scores come from agent-eval; each run's traceId points back into AgentLens so a flagged window is one click from the evidence: js import { queryScoredRuns } from "agent-eval"; interface ScoredRun { runId: string; traceId: string; // - AgentLens: the full route that produced this score score: number; // agent-eval rubric verdict, 0..1 at: number; // epoch ms } interface DriftReport { drifting: boolean; baselineMean: number; recentMean: number; deltaPct: number; // how far recent has moved from baseline zScore: number; // is the move bigger than normal run-to-run noise? sampleTraceIds: string ; // worst recent runs, for AgentLens drill-in } function mean xs: number : number { return xs.reduce a, b = a + b, 0 / xs.length; } function stdev xs: number , mu: number : number { return Math.sqrt mean xs.map x = x - mu 2 ; } // Compare a recent window against a trusted baseline window. // Drift = the recent mean has moved further than baseline NOISE explains. function detectDrift baseline: ScoredRun , recent: ScoredRun : DriftReport { const baseScores = baseline.map r = r.score ; const recentScores = recent.map r = r.score ; const baselineMean = mean baseScores ; const recentMean = mean recentScores ; const baselineSd = stdev baseScores, baselineMean || 1e-9; // Standard error of the recent window's mean, scaled by baseline noise. // This asks: is this gap real, or just the sample size talking? const se = baselineSd / Math.sqrt recentScores.length ; const zScore = recentMean - baselineMean / se; const deltaPct = recentMean - baselineMean / baselineMean 100; // Flag when quality dropped AND the drop is statistically meaningful. // z < -3 ~ a one-sided drop well outside normal run-to-run wobble. const drifting = zScore < -3 && deltaPct < -2; const sampleTraceIds = ...recent .sort a, b = a.score - b.score .slice 0, 5 .map r = r.traceId ; return { drifting, baselineMean, recentMean, deltaPct, zScore, sampleTraceIds }; } // Roll the windows forward continuously, not on deploy. async function checkProductionDrift : Promise