You Can't Reproduce Your Agent's Bugs—That's Why You Can't Fix Them

wpnews.pro

Here is a bug report I have received, in some form, at every company running agents in production:

"The agent gave a customer a wrong refund amount yesterday around 2pm. Can you look into it?"

Here is how that investigation goes when your stack isn't built for it: you find the timestamp, re-run the same prompt, and it works perfectly. Correct refund, every time. You change nothing; it keeps being right. Eventually you write "could not reproduce — will monitor," which is a professional way of saying you gave up.

This is the failure I think is most under-discussed in the whole agent space. Not hallucination, not drift, not cost. Irreproducibility. A bug you cannot reproduce is a bug you cannot fix, cannot test, and cannot prove you've fixed. Agents are, by nature, the most irreproducible software most of us have ever shipped.

The opinion I'll defend: reproducibility is the precondition for every other quality practice you claim to have. Your evals, regression tests, and CI gates all assume you can take a real failure and run it again on demand. If you can't, that apparatus is built on sand.

For a normal backend bug, reproduction is mostly free: same request, same row, same code, same bug. We've built an instinct that says if I run it again with the same inputs, I'll see the same thing. For agents that instinct is wrong in four independent ways, each enough to break reproduction alone:

Only the first is about the model's randomness. The other three are about inputs you failed to capture — which means the fix is mostly engineering discipline, not a model problem. You can't make production deterministic, but you can make a run replayable, and those are very different goals.

The instinct is to chase determinism: temperature: 0

, pin every version, freeze the world. That's a trap. A temperature-zero agent is often a worse agent, and you still haven't captured the tool outputs, so you still can't reproduce a past failure. Determinism is something you'd impose on all of production forever. Replayability is a property you attach to each run as it happens, and it's strictly more powerful: it reconstructs that specific failure no matter how nondeterministic production was.

To replay a run you must have captured, when it happened: the resolved input (the exact bytes the model saw after templating), every tool call's raw output (what the APIs returned then), and the execution parameters (model id and version, temperature, seed, system prompt).

This is the seam where the two tools I lean on operate as one unit, because reproduction needs both a record and a verdict. AgentLens captures the trace — every model and tool step, the resolved inputs, the raw outputs, the parameters — the raw material a replay is rebuilt from. agent-eval is the other half: it takes that captured run, re-executes it under pinned conditions, and scores whether the bug is present. AgentLens makes the failure replayable; agent-eval makes the replay a pass/fail test you can gate on. A trace with no scorer is an archive you read by hand; a scorer with no trace is grading a prompt you've already lost.

import { getTrace } from "agentlens";
import { evaluate, assert } from "agent-eval";

interface ReplayBundle {
  resolvedPrompt: string;                 // exact bytes the model saw at 2pm
  params: { model: string; temperature: number; seed?: number };
  toolReplays: Record<string, unknown>;   // call signature -> raw output THEN
}

// Pull a real failure out of AgentLens into a self-contained replay bundle.
async function bundleFromTrace(traceId: string): Promise<ReplayBundle> {
  const trace = await getTrace(traceId);
  const model = trace.steps.find((s) => s.kind === "model");
  if (!model) throw new Error(`no model step in ${traceId}`);

  // Freeze each tool's output by call signature, so a replay returns
  // yesterday's values instead of hitting today's moved-on world.
  const toolReplays: Record<string, unknown> = {};
  for (const s of trace.steps.filter((s) => s.kind === "tool")) {
    toolReplays[`${s.name}:${JSON.stringify(s.input)}`] = s.output;
  }
  return {
    resolvedPrompt: model.input,                                 // RESOLVED input
    params: { model: model.model, temperature: model.temperature, seed: model.seed },
    toolReplays,
  };
}

// Re-run the agent against the captured reality. Tools are stubbed to replay
// recorded outputs, so the ONLY thing that can vary is the agent itself.
async function replay(b: ReplayBundle): Promise<string> {
  return runAgent(b.resolvedPrompt, {
    ...b.params,
    toolResolver: (name: string, input: unknown) =>
      b.toolReplays[`${name}:${JSON.stringify(input)}`],
  });
}

// Turn the reproduced failure into a permanent regression eval.
async function lockAsRegression(traceId: string, mustNotContain: string[]) {
  const b = await bundleFromTrace(traceId);
  const output = await replay(b);
  return evaluate({
    input: b.resolvedPrompt,
    output,
    checks: [
      assert.notContains(mustNotContain),  // the bogus refund amount, forbidden forever
      assert.judge({ criterion: "refund amount matches the tool result", threshold: 0.8 }),
    ],
    metadata: { sourceTraceId: traceId },  // provenance back to the real incident
  });
}

Two decisions do all the work. Tool outputs are replayed, not re-fetched: the toolResolver

hands the agent yesterday's recorded responses instead of calling the live API. If your replay re-queries the database, you're testing a world that has changed, and any "fix" you observe might just be the data moving again — pinning tool outputs isolates the one variable you want to study. And the resolved prompt is replayed, not the template: the subtly-wrong retrieved document or the unusual account state lived in the resolved input, and nowhere else.

A subtlety I won't skip: if the failure happened on a high-temperature reasoning path, replaying once might reproduce it or might not, because you'll roll a different path. For nondeterministic failures a single replay isn't a reproduction; it's one sample.

The honest technique is to replay the bundle N times and measure the failure rate. A bug in 8 of 50 replays is reproduced — you've proven it's real and quantified it — even though no single run is guaranteed to show it. Your fix isn't "the replay passed once"; it's "the rate across 50 replays went from 16% to 0%."

async function reproductionRate(traceId: string, isBug: (o: string) => boolean, n = 50) {
  const b = await bundleFromTrace(traceId);
  const runs = await Promise.all(Array.from({ length: n }, () => replay(b)));
  const hits = runs.filter(isBug).length;
  return { rate: hits / n, hits, n }; // e.g. { rate: 0.16, hits: 8, n: 50 }
}

This reframes the bug-fixing loop. You don't fix a failure and eyeball it once; you capture it, establish its rate, change the agent, prove the rate dropped, and keep the replay in your suite so it can never silently climb back. agent-eval runs the bundle; the AgentLens trace behind it tells you, when a replay does fail, which step diverged from the recorded path.

You don't need production to be deterministic. You need every run to be reconstructable:

The agents will keep producing failures you can't explain from the symptom alone. The difference between a team that fixes them and one that closes tickets with "could not reproduce" isn't model quality or prompt skill — it's whether the run left behind enough of itself to be run again. Capture the trace with AgentLens, replay-and-score it with agent-eval, and "could not reproduce" stops being a sentence you're allowed to write.

source & further reading

dev.to — original article Why AI Agents Fail Silently — And How to Fix It A technical deep-dive into the observability gap in multi-step LLM systems AI Coding Agents Need Tests More Than Prompts Dokploy Setup Guide

You Can't Reproduce Your Agent's Bugs—That's Why You Can't Fix Them

Run your AI side-project on zahid.host