You Can't Reproduce Your Agent's Bugs—That's Why You Can't Fix Them A developer argues that irreproducibility is the most under-discussed failure in the agent space, as bugs that cannot be reproduced cannot be fixed or tested. They advocate for replayability over determinism, using tools like AgentLens and agent-eval to capture traces and score replays, enabling reliable debugging and regression testing. Here is a bug report I have received, in some form, at every company running agents in production: "The agent gave a customer a wrong refund amount yesterday around 2pm. Can you look into it?" Here is how that investigation goes when your stack isn't built for it: you find the timestamp, re-run the same prompt, and it works perfectly. Correct refund, every time. You change nothing; it keeps being right. Eventually you write "could not reproduce — will monitor," which is a professional way of saying you gave up. This is the failure I think is most under-discussed in the whole agent space. Not hallucination, not drift, not cost. Irreproducibility. A bug you cannot reproduce is a bug you cannot fix, cannot test, and cannot prove you've fixed. Agents are, by nature, the most irreproducible software most of us have ever shipped. The opinion I'll defend: reproducibility is the precondition for every other quality practice you claim to have. Your evals, regression tests, and CI gates all assume you can take a real failure and run it again on demand. If you can't, that apparatus is built on sand. For a normal backend bug, reproduction is mostly free: same request, same row, same code, same bug. We've built an instinct that says if I run it again with the same inputs, I'll see the same thing. For agents that instinct is wrong in four independent ways, each enough to break reproduction alone: Only the first is about the model's randomness. The other three are about inputs you failed to capture — which means the fix is mostly engineering discipline, not a model problem. You can't make production deterministic, but you can make a run replayable , and those are very different goals. The instinct is to chase determinism: temperature: 0 , pin every version, freeze the world. That's a trap. A temperature-zero agent is often a worse agent, and you still haven't captured the tool outputs, so you still can't reproduce a past failure. Determinism is something you'd impose on all of production forever. Replayability is a property you attach to each run as it happens, and it's strictly more powerful: it reconstructs that specific failure no matter how nondeterministic production was. To replay a run you must have captured, when it happened: the resolved input the exact bytes the model saw after templating , every tool call's raw output what the APIs returned then , and the execution parameters model id and version, temperature, seed, system prompt . This is the seam where the two tools I lean on operate as one unit, because reproduction needs both a record and a verdict. AgentLens captures the trace — every model and tool step, the resolved inputs, the raw outputs, the parameters — the raw material a replay is rebuilt from. agent-eval is the other half: it takes that captured run, re-executes it under pinned conditions, and scores whether the bug is present. AgentLens makes the failure replayable ; agent-eval makes the replay a pass/fail test you can gate on . A trace with no scorer is an archive you read by hand; a scorer with no trace is grading a prompt you've already lost. js import { getTrace } from "agentlens"; import { evaluate, assert } from "agent-eval"; interface ReplayBundle { resolvedPrompt: string; // exact bytes the model saw at 2pm params: { model: string; temperature: number; seed?: number }; toolReplays: Record