Your AI Agent Drifted Last Night and You Didn't Notice

An engineer has identified three distinct patterns of "agent drift"—gradual degradation in AI agent output quality that occurs without hard failures or schema violations, often going unnoticed for days until a customer complains. The developer argues that the hardest production failures are not crashes but quality degradation between eval checkpoints, requiring continuous runtime detection rather than just pre-deployment testing. After running production agents 24/7, the engineer documented drift patterns including stale context from outdated knowledge bases and API caches, as well as behavioral shifts from silent model updates or prompt injection attempts.

Your agent passed every test in CI. It ran fine in staging. Then it quietly started returning subtly wrong answers in production at 2 AM, and nobody noticed until a customer complained three days later. This is agent drift — the gradual degradation of agent output quality without any hard failures. No exceptions thrown. No schema violations. Just slowly worsening responses that slip past your monitoring. Here's my thesis: the hardest production failures aren't crashes — they're quality degradation that happens between your eval checkpoints. You need continuous runtime detection, not just pre-deployment testing. After running production agents 24/7, I've identified three distinct drift patterns: Your agent retrieves documents, knowledge bases, or API responses as context. That context decays: interface StalenessDetector { source: string; maxAgeMs: number; check: context: RetrievedContext = StalenessResult; } const stalenessChecks: StalenessDetector = { source: 'knowledge-base', maxAgeMs: 24 60 60 1000, // 24 hours check: ctx = { const age = Date.now - ctx.lastIndexedAt; const staleChunks = ctx.chunks.filter c = Date.now - c.sourceLastModified 7 24 60 60 1000 ; return { stale: staleChunks.length / ctx.chunks.length 0.3, staleFraction: staleChunks.length / ctx.chunks.length, oldestChunkAge: Math.max ...ctx.chunks.map c = Date.now - c.sourceLastModified , recommendation: staleChunks.length 0 ? ${staleChunks.length} chunks older than 7 days : 'fresh' }; } }, { source: 'api-response-cache', maxAgeMs: 60 60 1000, // 1 hour check: ctx = { const cached = ctx.apiResponses.filter r = r.fromCache ; const expired = cached.filter r = Date.now - r.cachedAt 60 60 1000 ; return { stale: expired.length 0, staleFraction: expired.length / Math.max cached.length, 1 , recommendation: expired.length 0 ? ${expired.length} cached API responses expired : 'fresh' }; } } ; The insidious part: stale context doesn't cause errors. Your agent happily generates confident answers based on outdated information. The output looks fine — it's just wrong. The agent's response patterns shift over time. Maybe the underlying model got a silent update. Maybe prompt injection attempts are subtly reshaping behavior. Maybe token distributions are shifting due to accumulated conversation context. interface DriftBaseline { dimension: string; expectedDistribution: { mean: number; stddev: number }; windowSize: number; } class BehavioralDriftDetector { private baselines: Map<string, DriftBaseline = new Map ; private observations: Map<string, number = new Map ; observe dimension: string, value: number : DriftAlert | null { const baseline = this.baselines.get dimension ; if baseline return null; const window = this.observations.get dimension || ; window.push value ; if window.length baseline.windowSize window.shift ; this.observations.set dimension, window ; if window.length < baseline.windowSize 0.5 return null; const currentMean = window.reduce a, b = a + b, 0 / window.length; const zScore = Math.abs currentMean - baseline.expectedDistribution.mean / baseline.expectedDistribution.stddev; if zScore 2.5 { return { dimension, severity: zScore 4 ? 'critical' : 'warning', currentMean, expectedMean: baseline.expectedDistribution.mean, zScore, message: ${dimension} drifted ${zScore.toFixed 1 } sigma from baseline }; } return null; } } The key insight: you're not evaluating individual outputs. You're evaluating the distribution of outputs over time. A single long response means nothing. A gradual increase in average response length across 100 runs? That's signal. Hallucination rates aren't constant. They vary with input complexity, context quality, and model state. The dangerous pattern: hallucination rate slowly climbs from 2% to 8% over a week, crossing your acceptable threshold without ever triggering a single hard failure. interface HallucinationCanary { name: string; detect: output: AgentOutput, groundTruth: GroundTruth = HallucinationSignal; } const canaries: HallucinationCanary = { name: 'entity-grounding', detect: output, truth = { const claimedEntities = extractEntities output.raw ; const groundedEntities = extractEntities truth.sourceDocuments ; const ungrounded = claimedEntities.filter e = groundedEntities.some g = semanticMatch e, g, 0.85 ; return { hallucinated: ungrounded.length 0, ungroundedEntities: ungrounded, groundingRate: 1 - ungrounded.length / Math.max claimedEntities.length, 1 }; } }, { name: 'numeric-consistency', detect: output, truth = { const claimedNumbers = extractNumericClaims output.raw ; const sourceNumbers = extractNumericClaims truth.sourceDocuments ; const inconsistent = claimedNumbers.filter claim = sourceNumbers.some src = src.entity === claim.entity && Math.abs src.value - claim.value / src.value < 0.05 ; return { hallucinated: inconsistent.length 0, inconsistentClaims: inconsistent, consistencyRate: 1 - inconsistent.length / Math.max claimedNumbers.length, 1 }; } } ; Detection is one half. The other half is what you do about it. Here's the runtime loop I've converged on: async function monitorAgentRun run: AgentRun : Promise<MonitorResult { // 1. Pre-execution: Check context freshness const stalenessResults = await checkStaleness run.context ; if stalenessResults.some r = r.stale { await refreshStaleContext run, stalenessResults ; } // 2. Post-execution: Lightweight drift check on every run const driftAlerts = trackDimensions run.output, { responseLength: estimateTokens run.output.raw , toolCalls: run.output.toolCallCount, latency: run.durationMs, confidenceProxy: run.output.metadata?.confidence ?? null } ; // 3. Sampled: Hallucination canary expensive, run on 10% sample let hallucinationResult = null; if Math.random < 0.1 && run.groundTruthAvailable { hallucinationResult = await runCanaries run.output, run.groundTruth ; } // 4. Alert on threshold breach if driftAlerts.some a = a.severity === 'critical' { await alertOncall 'agent-drift-critical', driftAlerts ; } return { stalenessResults, driftAlerts, hallucinationResult }; } Notice the layering: staleness checks are pre-execution you can fix stale context before the agent runs . Drift detection is post-execution and cheap runs on every invocation . Hallucination canaries are expensive and sampled. Three mistakes I made building this: 1. Alerting on individual outliers. An agent producing one long response isn't drift. I burned weeks chasing false positives before switching to windowed statistical detection. 2. Not versioning baselines. When you intentionally change agent behavior new prompt, new model , your baselines need to reset. Otherwise every intentional improvement triggers drift alerts. 3. Treating hallucination as binary. "The agent hallucinated" is useless. What did it hallucinate? Entities? Numbers? URLs? The category determines the fix. If you're running agents in production without drift monitoring, start here: Most teams discover drift through customer complaints. By the time a user says "your AI gave me wrong information," you've likely been serving degraded responses for days or weeks. The gap between "my agent works" and "my agent works reliably" is entirely about what happens between your evaluation checkpoints. Continuous monitoring isn't optional — it's the difference between running a demo and running a product. How are you detecting drift in your production agents? Or are you still finding out from users? I'm curious what signals have been most useful for early detection.