Your AI Agent Drifted Last Night and You Didn't Notice

wpnews.pro

Your agent passed every test in CI. It ran fine in staging. Then it quietly started returning subtly wrong answers in production at 2 AM, and nobody noticed until a customer complained three days later.

This is agent drift — the gradual degradation of agent output quality without any hard failures. No exceptions thrown. No schema violations. Just slowly worsening responses that slip past your monitoring.

Here's my thesis: the hardest production failures aren't crashes — they're quality degradation that happens between your eval checkpoints. You need continuous runtime detection, not just pre-deployment testing.

After running production agents 24/7, I've identified three distinct drift patterns:

Your agent retrieves documents, knowledge bases, or API responses as context. That context decays:

interface StalenessDetector {
  source: string;
  maxAgeMs: number;
  check: (context: RetrievedContext) => StalenessResult;
}

const stalenessChecks: StalenessDetector[] = [
  {
    source: 'knowledge-base',
    maxAgeMs: 24 * 60 * 60 * 1000, // 24 hours
    check: (ctx) => {
      const age = Date.now() - ctx.lastIndexedAt;
      const staleChunks = ctx.chunks.filter(c => 
        Date.now() - c.sourceLastModified > 7 * 24 * 60 * 60 * 1000
      );
      return {
        stale: staleChunks.length / ctx.chunks.length > 0.3,
        staleFraction: staleChunks.length / ctx.chunks.length,
        oldestChunkAge: Math.max(...ctx.chunks.map(c => Date.now() - c.sourceLastModified)),
        recommendation: staleChunks.length > 0 
          ? `${staleChunks.length} chunks older than 7 days` 
          : 'fresh'
      };
    }
  },
  {
    source: 'api-response-cache',
    maxAgeMs: 60 * 60 * 1000, // 1 hour
    check: (ctx) => {
      const cached = ctx.apiResponses.filter(r => r.fromCache);
      const expired = cached.filter(r => Date.now() - r.cachedAt > 60 * 60 * 1000);
      return {
        stale: expired.length > 0,
        staleFraction: expired.length / Math.max(cached.length, 1),
        recommendation: expired.length > 0
          ? `${expired.length} cached API responses expired`
          : 'fresh'
      };
    }
  }
];

The insidious part: stale context doesn't cause errors. Your agent happily generates confident answers based on outdated information. The output looks fine — it's just wrong.

The agent's response patterns shift over time. Maybe the underlying model got a silent update. Maybe prompt injection attempts are subtly reshaping behavior. Maybe token distributions are shifting due to accumulated conversation context.

interface DriftBaseline {
  dimension: string;
  expectedDistribution: { mean: number; stddev: number };
  windowSize: number;
}

class BehavioralDriftDetector {
  private baselines: Map<string, DriftBaseline> = new Map();
  private observations: Map<string, number[]> = new Map();

  observe(dimension: string, value: number): DriftAlert | null {
    const baseline = this.baselines.get(dimension);
    if (!baseline) return null;

    const window = this.observations.get(dimension) || [];
    window.push(value);
    if (window.length > baseline.windowSize) window.shift();
    this.observations.set(dimension, window);

    if (window.length < baseline.windowSize * 0.5) return null;

    const currentMean = window.reduce((a, b) => a + b, 0) / window.length;
    const zScore = Math.abs(currentMean - baseline.expectedDistribution.mean) 
      / baseline.expectedDistribution.stddev;

    if (zScore > 2.5) {
      return {
        dimension,
        severity: zScore > 4 ? 'critical' : 'warning',
        currentMean,
        expectedMean: baseline.expectedDistribution.mean,
        zScore,
        message: `${dimension} drifted ${zScore.toFixed(1)} sigma from baseline`
      };
    }
    return null;
  }
}

The key insight: you're not evaluating individual outputs. You're evaluating the distribution of outputs over time. A single long response means nothing. A gradual increase in average response length across 100 runs? That's signal.

Hallucination rates aren't constant. They vary with input complexity, context quality, and model state. The dangerous pattern: hallucination rate slowly climbs from 2% to 8% over a week, crossing your acceptable threshold without ever triggering a single hard failure.

interface HallucinationCanary {
  name: string;
  detect: (output: AgentOutput, groundTruth: GroundTruth) => HallucinationSignal;
}

const canaries: HallucinationCanary[] = [
  {
    name: 'entity-grounding',
    detect: (output, truth) => {
      const claimedEntities = extractEntities(output.raw);
      const groundedEntities = extractEntities(truth.sourceDocuments);
      const ungrounded = claimedEntities.filter(e => 
        !groundedEntities.some(g => semanticMatch(e, g, 0.85))
      );
      return {
        hallucinated: ungrounded.length > 0,
        ungroundedEntities: ungrounded,
        groundingRate: 1 - (ungrounded.length / Math.max(claimedEntities.length, 1))
      };
    }
  },
  {
    name: 'numeric-consistency',
    detect: (output, truth) => {
      const claimedNumbers = extractNumericClaims(output.raw);
      const sourceNumbers = extractNumericClaims(truth.sourceDocuments);
      const inconsistent = claimedNumbers.filter(claim =>
        !sourceNumbers.some(src => 
          src.entity === claim.entity && 
          Math.abs(src.value - claim.value) / src.value < 0.05
        )
      );
      return {
        hallucinated: inconsistent.length > 0,
        inconsistentClaims: inconsistent,
        consistencyRate: 1 - (inconsistent.length / Math.max(claimedNumbers.length, 1))
      };
    }
  }
];

Detection is one half. The other half is what you do about it. Here's the runtime loop I've converged on:

async function monitorAgentRun(run: AgentRun): Promise<MonitorResult> {
  // 1. Pre-execution: Check context freshness
  const stalenessResults = await checkStaleness(run.context);
  if (stalenessResults.some(r => r.stale)) {
    await refreshStaleContext(run, stalenessResults);
  }

  // 2. Post-execution: Lightweight drift check on every run
  const driftAlerts = trackDimensions(run.output, {
    responseLength: estimateTokens(run.output.raw),
    toolCalls: run.output.toolCallCount,
    latency: run.durationMs,
    confidenceProxy: run.output.metadata?.confidence ?? null
  });

  // 3. Sampled: Hallucination canary (expensive, run on 10% sample)
  let hallucinationResult = null;
  if (Math.random() < 0.1 && run.groundTruthAvailable) {
    hallucinationResult = await runCanaries(run.output, run.groundTruth);
  }

  // 4. Alert on threshold breach
  if (driftAlerts.some(a => a.severity === 'critical')) {
    await alertOncall('agent-drift-critical', driftAlerts);
  }

  return { stalenessResults, driftAlerts, hallucinationResult };
}

Notice the layering: staleness checks are pre-execution (you can fix stale context before the agent runs). Drift detection is post-execution and cheap (runs on every invocation). Hallucination canaries are expensive and sampled.

Three mistakes I made building this:

1. Alerting on individual outliers. An agent producing one long response isn't drift. I burned weeks chasing false positives before switching to windowed statistical detection.

2. Not versioning baselines. When you intentionally change agent behavior (new prompt, new model), your baselines need to reset. Otherwise every intentional improvement triggers drift alerts.

3. Treating hallucination as binary. "The agent hallucinated" is useless. What did it hallucinate? Entities? Numbers? URLs? The category determines the fix.

If you're running agents in production without drift monitoring, start here:

Most teams discover drift through customer complaints. By the time a user says "your AI gave me wrong information," you've likely been serving degraded responses for days or weeks.

The gap between "my agent works" and "my agent works reliably" is entirely about what happens between your evaluation checkpoints. Continuous monitoring isn't optional — it's the difference between running a demo and running a product.

How are you detecting drift in your production agents? Or are you still finding out from users? I'm curious what signals have been most useful for early detection.

source & further reading

dev.to — original article I Couldn’t Fix My LLM Costs Until I Measured Tokens Per Feature Small Model SWE‑bench: What Happens When You Push Tiny Models Into Full Task Pipelines Grok 4.5 Isn't Open Source. The Apache 2.0 Release Has a Privacy Catch.

Your AI Agent Drifted Last Night and You Didn't Notice

Run your AI side-project on zahid.host