You Can't Reproduce Your Agent's Bugs—That's Why You Can't Fix Them

A developer argues that irreproducibility is the most under-discussed failure in the agent space, as bugs that cannot be reproduced cannot be fixed or tested. They advocate for replayability over determinism, using tools like AgentLens and agent-eval to capture traces and score replays, enabling reliable debugging and regression testing.

Here is a bug report I have received, in some form, at every company running agents in production: "The agent gave a customer a wrong refund amount yesterday around 2pm. Can you look into it?" Here is how that investigation goes when your stack isn't built for it: you find the timestamp, re-run the same prompt, and it works perfectly. Correct refund, every time. You change nothing; it keeps being right. Eventually you write "could not reproduce — will monitor," which is a professional way of saying you gave up. This is the failure I think is most under-discussed in the whole agent space. Not hallucination, not drift, not cost. Irreproducibility. A bug you cannot reproduce is a bug you cannot fix, cannot test, and cannot prove you've fixed. Agents are, by nature, the most irreproducible software most of us have ever shipped. The opinion I'll defend: reproducibility is the precondition for every other quality practice you claim to have. Your evals, regression tests, and CI gates all assume you can take a real failure and run it again on demand. If you can't, that apparatus is built on sand. For a normal backend bug, reproduction is mostly free: same request, same row, same code, same bug. We've built an instinct that says if I run it again with the same inputs, I'll see the same thing. For agents that instinct is wrong in four independent ways, each enough to break reproduction alone: Only the first is about the model's randomness. The other three are about inputs you failed to capture — which means the fix is mostly engineering discipline, not a model problem. You can't make production deterministic, but you can make a run replayable , and those are very different goals. The instinct is to chase determinism: temperature: 0 , pin every version, freeze the world. That's a trap. A temperature-zero agent is often a worse agent, and you still haven't captured the tool outputs, so you still can't reproduce a past failure. Determinism is something you'd impose on all of production forever. Replayability is a property you attach to each run as it happens, and it's strictly more powerful: it reconstructs that specific failure no matter how nondeterministic production was. To replay a run you must have captured, when it happened: the resolved input the exact bytes the model saw after templating , every tool call's raw output what the APIs returned then , and the execution parameters model id and version, temperature, seed, system prompt . This is the seam where the two tools I lean on operate as one unit, because reproduction needs both a record and a verdict. AgentLens captures the trace — every model and tool step, the resolved inputs, the raw outputs, the parameters — the raw material a replay is rebuilt from. agent-eval is the other half: it takes that captured run, re-executes it under pinned conditions, and scores whether the bug is present. AgentLens makes the failure replayable ; agent-eval makes the replay a pass/fail test you can gate on . A trace with no scorer is an archive you read by hand; a scorer with no trace is grading a prompt you've already lost. js import { getTrace } from "agentlens"; import { evaluate, assert } from "agent-eval"; interface ReplayBundle { resolvedPrompt: string; // exact bytes the model saw at 2pm params: { model: string; temperature: number; seed?: number }; toolReplays: Record<string, unknown ; // call signature - raw output THEN } // Pull a real failure out of AgentLens into a self-contained replay bundle. async function bundleFromTrace traceId: string : Promise<ReplayBundle { const trace = await getTrace traceId ; const model = trace.steps.find s = s.kind === "model" ; if model throw new Error no model step in ${traceId} ; // Freeze each tool's output by call signature, so a replay returns // yesterday's values instead of hitting today's moved-on world. const toolReplays: Record<string, unknown = {}; for const s of trace.steps.filter s = s.kind === "tool" { toolReplays ${s.name}:${JSON.stringify s.input } = s.output; } return { resolvedPrompt: model.input, // RESOLVED input params: { model: model.model, temperature: model.temperature, seed: model.seed }, toolReplays, }; } // Re-run the agent against the captured reality. Tools are stubbed to replay // recorded outputs, so the ONLY thing that can vary is the agent itself. async function replay b: ReplayBundle : Promise<string { return runAgent b.resolvedPrompt, { ...b.params, toolResolver: name: string, input: unknown = b.toolReplays ${name}:${JSON.stringify input } , } ; } // Turn the reproduced failure into a permanent regression eval. async function lockAsRegression traceId: string, mustNotContain: string { const b = await bundleFromTrace traceId ; const output = await replay b ; return evaluate { input: b.resolvedPrompt, output, checks: assert.notContains mustNotContain , // the bogus refund amount, forbidden forever assert.judge { criterion: "refund amount matches the tool result", threshold: 0.8 } , , metadata: { sourceTraceId: traceId }, // provenance back to the real incident } ; } Two decisions do all the work. Tool outputs are replayed, not re-fetched: the toolResolver hands the agent yesterday's recorded responses instead of calling the live API. If your replay re-queries the database, you're testing a world that has changed, and any "fix" you observe might just be the data moving again — pinning tool outputs isolates the one variable you want to study. And the resolved prompt is replayed, not the template: the subtly-wrong retrieved document or the unusual account state lived in the resolved input, and nowhere else. A subtlety I won't skip: if the failure happened on a high-temperature reasoning path, replaying once might reproduce it or might not, because you'll roll a different path. For nondeterministic failures a single replay isn't a reproduction; it's one sample. The honest technique is to replay the bundle N times and measure the failure rate . A bug in 8 of 50 replays is reproduced — you've proven it's real and quantified it — even though no single run is guaranteed to show it. Your fix isn't "the replay passed once"; it's "the rate across 50 replays went from 16% to 0%." js async function reproductionRate traceId: string, isBug: o: string = boolean, n = 50 { const b = await bundleFromTrace traceId ; const runs = await Promise.all Array.from { length: n }, = replay b ; const hits = runs.filter isBug .length; return { rate: hits / n, hits, n }; // e.g. { rate: 0.16, hits: 8, n: 50 } } This reframes the bug-fixing loop. You don't fix a failure and eyeball it once; you capture it, establish its rate, change the agent, prove the rate dropped, and keep the replay in your suite so it can never silently climb back. agent-eval runs the bundle; the AgentLens trace behind it tells you, when a replay does fail, which step diverged from the recorded path. You don't need production to be deterministic. You need every run to be reconstructable : The agents will keep producing failures you can't explain from the symptom alone. The difference between a team that fixes them and one that closes tickets with "could not reproduce" isn't model quality or prompt skill — it's whether the run left behind enough of itself to be run again. Capture the trace with AgentLens, replay-and-score it with agent-eval, and "could not reproduce" stops being a sentence you're allowed to write.