# You Can't Reproduce Your Agent's Bugs—That's Why You Can't Fix Them

> Source: <https://dev.to/saurav_bhattacharya/you-cant-reproduce-your-agents-bugs-thats-why-you-cant-fix-them-223i>
> Published: 2026-06-24 01:05:27+00:00

Here is a bug report I have received, in some form, at every company running agents in production:

"The agent gave a customer a wrong refund amount yesterday around 2pm. Can you look into it?"

Here is how that investigation goes when your stack isn't built for it: you find the timestamp, re-run the same prompt, and it works perfectly. Correct refund, every time. You change nothing; it keeps being right. Eventually you write "could not reproduce — will monitor," which is a professional way of saying you gave up.

This is the failure I think is most under-discussed in the whole agent space. Not hallucination, not drift, not cost. **Irreproducibility.** A bug you cannot reproduce is a bug you cannot fix, cannot test, and cannot prove you've fixed. Agents are, by nature, the most irreproducible software most of us have ever shipped.

The opinion I'll defend: **reproducibility is the precondition for every other quality practice you claim to have.** Your evals, regression tests, and CI gates all assume you can take a real failure and run it again on demand. If you can't, that apparatus is built on sand.

For a normal backend bug, reproduction is mostly free: same request, same row, same code, same bug. We've built an instinct that says *if I run it again with the same inputs, I'll see the same thing.* For agents that instinct is wrong in four independent ways, each enough to break reproduction alone:

Only the *first* is about the model's randomness. The other three are about **inputs you failed to capture** — which means the fix is mostly engineering discipline, not a model problem. You can't make production deterministic, but you can make a run *replayable*, and those are very different goals.

The instinct is to chase determinism: `temperature: 0`

, pin every version, freeze the world. That's a trap. A temperature-zero agent is often a worse agent, and you still haven't captured the tool outputs, so you *still* can't reproduce a past failure. Determinism is something you'd impose on all of production forever. Replayability is a property you attach to each run as it happens, and it's strictly more powerful: it reconstructs *that specific failure* no matter how nondeterministic production was.

To replay a run you must have captured, when it happened: the **resolved input** (the exact bytes the model saw after templating), every **tool call's raw output** (what the APIs returned *then*), and the **execution parameters** (model id and version, temperature, seed, system prompt).

This is the seam where the two tools I lean on operate as one unit, because reproduction needs both a record and a verdict. **AgentLens captures the trace** — every model and tool step, the resolved inputs, the raw outputs, the parameters — the raw material a replay is rebuilt from. **agent-eval** is the other half: it takes that captured run, re-executes it under pinned conditions, and *scores* whether the bug is present. AgentLens makes the failure *replayable*; agent-eval makes the replay *a pass/fail test you can gate on*. A trace with no scorer is an archive you read by hand; a scorer with no trace is grading a prompt you've already lost.

``` js
import { getTrace } from "agentlens";
import { evaluate, assert } from "agent-eval";

interface ReplayBundle {
  resolvedPrompt: string;                 // exact bytes the model saw at 2pm
  params: { model: string; temperature: number; seed?: number };
  toolReplays: Record<string, unknown>;   // call signature -> raw output THEN
}

// Pull a real failure out of AgentLens into a self-contained replay bundle.
async function bundleFromTrace(traceId: string): Promise<ReplayBundle> {
  const trace = await getTrace(traceId);
  const model = trace.steps.find((s) => s.kind === "model");
  if (!model) throw new Error(`no model step in ${traceId}`);

  // Freeze each tool's output by call signature, so a replay returns
  // yesterday's values instead of hitting today's moved-on world.
  const toolReplays: Record<string, unknown> = {};
  for (const s of trace.steps.filter((s) => s.kind === "tool")) {
    toolReplays[`${s.name}:${JSON.stringify(s.input)}`] = s.output;
  }
  return {
    resolvedPrompt: model.input,                                 // RESOLVED input
    params: { model: model.model, temperature: model.temperature, seed: model.seed },
    toolReplays,
  };
}

// Re-run the agent against the captured reality. Tools are stubbed to replay
// recorded outputs, so the ONLY thing that can vary is the agent itself.
async function replay(b: ReplayBundle): Promise<string> {
  return runAgent(b.resolvedPrompt, {
    ...b.params,
    toolResolver: (name: string, input: unknown) =>
      b.toolReplays[`${name}:${JSON.stringify(input)}`],
  });
}

// Turn the reproduced failure into a permanent regression eval.
async function lockAsRegression(traceId: string, mustNotContain: string[]) {
  const b = await bundleFromTrace(traceId);
  const output = await replay(b);
  return evaluate({
    input: b.resolvedPrompt,
    output,
    checks: [
      assert.notContains(mustNotContain),  // the bogus refund amount, forbidden forever
      assert.judge({ criterion: "refund amount matches the tool result", threshold: 0.8 }),
    ],
    metadata: { sourceTraceId: traceId },  // provenance back to the real incident
  });
}
```

Two decisions do all the work. **Tool outputs are replayed, not re-fetched:** the `toolResolver`

hands the agent yesterday's recorded responses instead of calling the live API. If your replay re-queries the database, you're testing a world that has changed, and any "fix" you observe might just be the data moving again — pinning tool outputs isolates the one variable you want to study. And **the resolved prompt is replayed, not the template:** the subtly-wrong retrieved document or the unusual account state lived in the resolved input, and nowhere else.

A subtlety I won't skip: if the failure happened *on* a high-temperature reasoning path, replaying once might reproduce it or might not, because you'll roll a different path. For nondeterministic failures a single replay isn't a reproduction; it's one sample.

The honest technique is to replay the bundle *N* times and measure the failure *rate*. A bug in 8 of 50 replays is reproduced — you've proven it's real and quantified it — even though no single run is guaranteed to show it. Your fix isn't "the replay passed once"; it's "the rate across 50 replays went from 16% to 0%."

``` js
async function reproductionRate(traceId: string, isBug: (o: string) => boolean, n = 50) {
  const b = await bundleFromTrace(traceId);
  const runs = await Promise.all(Array.from({ length: n }, () => replay(b)));
  const hits = runs.filter(isBug).length;
  return { rate: hits / n, hits, n }; // e.g. { rate: 0.16, hits: 8, n: 50 }
}
```

This reframes the bug-fixing loop. You don't fix a failure and eyeball it once; you capture it, establish its rate, change the agent, prove the rate dropped, and keep the replay in your suite so it can never silently climb back. agent-eval runs the bundle; the AgentLens trace behind it tells you, when a replay *does* fail, which step diverged from the recorded path.

You don't need production to be deterministic. You need every run to be *reconstructable*:

The agents will keep producing failures you can't explain from the symptom alone. The difference between a team that fixes them and one that closes tickets with "could not reproduce" isn't model quality or prompt skill — it's whether the run left behind enough of itself to be run again. Capture the trace with AgentLens, replay-and-score it with agent-eval, and "could not reproduce" stops being a sentence you're allowed to write.
