There is a specific kind of incident that no alert ever fires for, and it is the one I trust least. Nothing crashed. No exception, no 500, no failed health check. The agent ran every day, returned answers every time, and stayed green on every dashboard you own. And yet, over six weeks, it got measurably worse — and you found out from a customer, not a monitor.
That is drift, and it is the failure mode I think the industry is least prepared for. We have gotten good at catching the cliff: the agent throws, the tool 500s, the JSON won't parse, CI goes red. We are still terrible at catching the slope: answer quality bleeding out two percent a week while every system reports perfect health. Crashes are loud and self-announcing. Drift is silent by construction, and that silence is exactly why it wins.
Here is the opinion I will defend: drift is not an outlier problem, it's a baseline problem. You cannot detect decay by looking at any single run, because a single run looks completely fine. Drift only exists as a change in a distribution over time — so if you are not continuously scoring production and trending the score, you are structurally incapable of seeing it. Not unlucky. Incapable.
The thing that makes drift so disorienting is that it violates our deepest instinct: if the code didn't change, the behavior didn't change. For agents, that is just wrong. Your agent decays while your git history sits perfectly still:
gpt-4o
, but a pinned model name is not a pinned model — providers roll checkpoints and quietly re-tune behind a stable string. Your prompt is byte-for-byte identical and your outputs shifted anyway.Not one of these shows up in a code diff. Not one throws. Every one degrades what your users actually experience. This is why "we'll notice if it breaks" is a fantasy — the most expensive agent regressions don't break anything.
To detect drift you need two things: a baseline — what "normal" scored like over a trusted window — and a continuous signal, the same score computed the same way on live traffic. Drift is the gap between them, measured statistically, not by eyeball.
The naive version is a single threshold: "alert if quality drops below 0.8." That catches the cliff and misses the slope. A score that walks from 0.91 to 0.82 over five weeks never trips an absolute floor, yet it has lost nearly a tenth of its quality. You are not looking for low; you are looking for moving — a different statistical question, and it needs the baseline.
This is where evaluation and observability stop being separate concerns and become one workflow — because you need both a thing that scores and a thing that remembers the route. I run agent-eval to score and gate the agent's output: deterministic checks where it can, a model-as-judge rubric where it must, and crucially it persists each verdict so a series of scores exists to trend at all. And I run AgentLens to capture the trace behind every scored run — every model and tool step, the resolved inputs the model actually saw after interpolation, the raw outputs that came back. The pairing is the whole point: agent-eval tells you the score is drifting; AgentLens tells you which step started drifting. A drift alert with no trace behind it is just a number falling on a chart with no way to ask why — and "quality is down 6% this month, cause unknown" isn't an actionable signal, it's an anxiety generator.
Here is a drift detector over a rolling window of scored production runs. The scores come from agent-eval; each run's traceId
points back into AgentLens so a flagged window is one click from the evidence:
import { queryScoredRuns } from "agent-eval";
interface ScoredRun {
runId: string;
traceId: string; // -> AgentLens: the full route that produced this score
score: number; // agent-eval rubric verdict, 0..1
at: number; // epoch ms
}
interface DriftReport {
drifting: boolean;
baselineMean: number;
recentMean: number;
deltaPct: number; // how far recent has moved from baseline
zScore: number; // is the move bigger than normal run-to-run noise?
sampleTraceIds: string[]; // worst recent runs, for AgentLens drill-in
}
function mean(xs: number[]): number {
return xs.reduce((a, b) => a + b, 0) / xs.length;
}
function stdev(xs: number[], mu: number): number {
return Math.sqrt(mean(xs.map((x) => (x - mu) ** 2)));
}
// Compare a recent window against a trusted baseline window.
// Drift = the recent mean has moved further than baseline NOISE explains.
function detectDrift(baseline: ScoredRun[], recent: ScoredRun[]): DriftReport {
const baseScores = baseline.map((r) => r.score);
const recentScores = recent.map((r) => r.score);
const baselineMean = mean(baseScores);
const recentMean = mean(recentScores);
const baselineSd = stdev(baseScores, baselineMean) || 1e-9;
// Standard error of the recent window's mean, scaled by baseline noise.
// This asks: is this gap real, or just the sample size talking?
const se = baselineSd / Math.sqrt(recentScores.length);
const zScore = (recentMean - baselineMean) / se;
const deltaPct = ((recentMean - baselineMean) / baselineMean) * 100;
// Flag when quality dropped AND the drop is statistically meaningful.
// z < -3 ~ a one-sided drop well outside normal run-to-run wobble.
const drifting = zScore < -3 && deltaPct < -2;
const sampleTraceIds = [...recent]
.sort((a, b) => a.score - b.score)
.slice(0, 5)
.map((r) => r.traceId);
return { drifting, baselineMean, recentMean, deltaPct, zScore, sampleTraceIds };
}
// Roll the windows forward continuously, not on deploy.
async function checkProductionDrift(): Promise<DriftReport> {
const baseline = await queryScoredRuns({ from: "-30d", to: "-7d" });
const recent = await queryScoredRuns({ from: "-7d", to: "now" });
return detectDrift(baseline, recent);
}
Two design decisions carry the whole approach.
It compares against baseline noise, not an absolute floor. The zScore
is the entire trick. Every agent's scores wobble run-to-run — that is normal nondeterminism, not decay. By dividing the drop by the standard error of the recent window, you only fire when the move is bigger than the agent's own natural jitter. A 1% dip on a noisy agent is nothing; the same dip on a rock-steady one is a five-alarm signal. An absolute threshold cannot tell those apart.
It emits sampleTraceIds, not just a verdict. A boolean
drifting: true
is where most homegrown detectors stop, and it's why they get ignored — nobody can act on it. By attaching the five worst recent runs' trace IDs, the alert carries its own evidence: you open those AgentLens traces and read the resolved inputs and tool outputs that produced the low scores. That is the difference between "quality is down, somebody investigate" and "quality is down, and here is the retrieval step that started returning stale documents."One trap worth calling out, because it produces the most confusing drift incidents: a healthy aggregate can hide a brutal per-segment collapse. Your overall score holds at 0.90 while your Spanish-language traffic quietly craters from 0.88 to 0.61 — masked because it's only 8% of volume and the other 92% is fine. The aggregate is technically accurate and completely useless.
So slice the baseline along the dimensions that actually vary — language, tool path, user tier, intent — and run the same drift check per slice.
async function driftBySegment(segmentBy: (r: ScoredRun) => string) {
const baseline = await queryScoredRuns({ from: "-30d", to: "-7d" });
const recent = await queryScoredRuns({ from: "-7d", to: "now" });
const group = (runs: ScoredRun[]) => {
const m = new Map<string, ScoredRun[]>();
for (const r of runs) (m.get(segmentBy(r)) ?? m.set(segmentBy(r), []).get(segmentBy(r))!).push(r);
return m;
};
const baseGroups = group(baseline);
const recentGroups = group(recent);
for (const [seg, recentRuns] of recentGroups) {
const baseRuns = baseGroups.get(seg);
if (!baseRuns || recentRuns.length < 20) continue; // need signal to call it
const report = detectDrift(baseRuns, recentRuns);
if (report.drifting) {
console.warn(
`DRIFT [${seg}] ${report.baselineMean.toFixed(3)} -> ` +
`${report.recentMean.toFixed(3)} (${report.deltaPct.toFixed(1)}%) ` +
`traces: ${report.sampleTraceIds.join(", ")}`,
);
}
}
}
The most dangerous drift hides inside an average. Segmenting the baseline drags it into the light, and the per-segment trace IDs tell you, via AgentLens, exactly which step does badly on those inputs.
You don't need a statistics PhD or a platform team to start — you need a baseline and a trend:
The agents are not going to crash on their way down. They will keep answering, keep returning 200s, keep looking healthy, and get quietly worse until the decay is large enough for a human to notice — the most expensive possible detector. Score the output with agent-eval, keep the route with AgentLens, trend the two against a baseline, and you catch the slope while it's still two percent instead of explaining the cliff to a customer who found it first.