{"slug": "you-can-t-reproduce-your-agent-s-bugs-that-s-why-you-can-t-fix-them", "title": "You Can't Reproduce Your Agent's Bugs—That's Why You Can't Fix Them", "summary": "A developer argues that irreproducibility is the most under-discussed failure in the agent space, as bugs that cannot be reproduced cannot be fixed or tested. They advocate for replayability over determinism, using tools like AgentLens and agent-eval to capture traces and score replays, enabling reliable debugging and regression testing.", "body_md": "Here is a bug report I have received, in some form, at every company running agents in production:\n\n\"The agent gave a customer a wrong refund amount yesterday around 2pm. Can you look into it?\"\n\nHere is how that investigation goes when your stack isn't built for it: you find the timestamp, re-run the same prompt, and it works perfectly. Correct refund, every time. You change nothing; it keeps being right. Eventually you write \"could not reproduce — will monitor,\" which is a professional way of saying you gave up.\n\nThis is the failure I think is most under-discussed in the whole agent space. Not hallucination, not drift, not cost. **Irreproducibility.** A bug you cannot reproduce is a bug you cannot fix, cannot test, and cannot prove you've fixed. Agents are, by nature, the most irreproducible software most of us have ever shipped.\n\nThe opinion I'll defend: **reproducibility is the precondition for every other quality practice you claim to have.** Your evals, regression tests, and CI gates all assume you can take a real failure and run it again on demand. If you can't, that apparatus is built on sand.\n\nFor a normal backend bug, reproduction is mostly free: same request, same row, same code, same bug. We've built an instinct that says *if I run it again with the same inputs, I'll see the same thing.* For agents that instinct is wrong in four independent ways, each enough to break reproduction alone:\n\nOnly the *first* is about the model's randomness. The other three are about **inputs you failed to capture** — which means the fix is mostly engineering discipline, not a model problem. You can't make production deterministic, but you can make a run *replayable*, and those are very different goals.\n\nThe instinct is to chase determinism: `temperature: 0`\n\n, pin every version, freeze the world. That's a trap. A temperature-zero agent is often a worse agent, and you still haven't captured the tool outputs, so you *still* can't reproduce a past failure. Determinism is something you'd impose on all of production forever. Replayability is a property you attach to each run as it happens, and it's strictly more powerful: it reconstructs *that specific failure* no matter how nondeterministic production was.\n\nTo replay a run you must have captured, when it happened: the **resolved input** (the exact bytes the model saw after templating), every **tool call's raw output** (what the APIs returned *then*), and the **execution parameters** (model id and version, temperature, seed, system prompt).\n\nThis is the seam where the two tools I lean on operate as one unit, because reproduction needs both a record and a verdict. **AgentLens captures the trace** — every model and tool step, the resolved inputs, the raw outputs, the parameters — the raw material a replay is rebuilt from. **agent-eval** is the other half: it takes that captured run, re-executes it under pinned conditions, and *scores* whether the bug is present. AgentLens makes the failure *replayable*; agent-eval makes the replay *a pass/fail test you can gate on*. A trace with no scorer is an archive you read by hand; a scorer with no trace is grading a prompt you've already lost.\n\n``` js\nimport { getTrace } from \"agentlens\";\nimport { evaluate, assert } from \"agent-eval\";\n\ninterface ReplayBundle {\n  resolvedPrompt: string;                 // exact bytes the model saw at 2pm\n  params: { model: string; temperature: number; seed?: number };\n  toolReplays: Record<string, unknown>;   // call signature -> raw output THEN\n}\n\n// Pull a real failure out of AgentLens into a self-contained replay bundle.\nasync function bundleFromTrace(traceId: string): Promise<ReplayBundle> {\n  const trace = await getTrace(traceId);\n  const model = trace.steps.find((s) => s.kind === \"model\");\n  if (!model) throw new Error(`no model step in ${traceId}`);\n\n  // Freeze each tool's output by call signature, so a replay returns\n  // yesterday's values instead of hitting today's moved-on world.\n  const toolReplays: Record<string, unknown> = {};\n  for (const s of trace.steps.filter((s) => s.kind === \"tool\")) {\n    toolReplays[`${s.name}:${JSON.stringify(s.input)}`] = s.output;\n  }\n  return {\n    resolvedPrompt: model.input,                                 // RESOLVED input\n    params: { model: model.model, temperature: model.temperature, seed: model.seed },\n    toolReplays,\n  };\n}\n\n// Re-run the agent against the captured reality. Tools are stubbed to replay\n// recorded outputs, so the ONLY thing that can vary is the agent itself.\nasync function replay(b: ReplayBundle): Promise<string> {\n  return runAgent(b.resolvedPrompt, {\n    ...b.params,\n    toolResolver: (name: string, input: unknown) =>\n      b.toolReplays[`${name}:${JSON.stringify(input)}`],\n  });\n}\n\n// Turn the reproduced failure into a permanent regression eval.\nasync function lockAsRegression(traceId: string, mustNotContain: string[]) {\n  const b = await bundleFromTrace(traceId);\n  const output = await replay(b);\n  return evaluate({\n    input: b.resolvedPrompt,\n    output,\n    checks: [\n      assert.notContains(mustNotContain),  // the bogus refund amount, forbidden forever\n      assert.judge({ criterion: \"refund amount matches the tool result\", threshold: 0.8 }),\n    ],\n    metadata: { sourceTraceId: traceId },  // provenance back to the real incident\n  });\n}\n```\n\nTwo decisions do all the work. **Tool outputs are replayed, not re-fetched:** the `toolResolver`\n\nhands the agent yesterday's recorded responses instead of calling the live API. If your replay re-queries the database, you're testing a world that has changed, and any \"fix\" you observe might just be the data moving again — pinning tool outputs isolates the one variable you want to study. And **the resolved prompt is replayed, not the template:** the subtly-wrong retrieved document or the unusual account state lived in the resolved input, and nowhere else.\n\nA subtlety I won't skip: if the failure happened *on* a high-temperature reasoning path, replaying once might reproduce it or might not, because you'll roll a different path. For nondeterministic failures a single replay isn't a reproduction; it's one sample.\n\nThe honest technique is to replay the bundle *N* times and measure the failure *rate*. A bug in 8 of 50 replays is reproduced — you've proven it's real and quantified it — even though no single run is guaranteed to show it. Your fix isn't \"the replay passed once\"; it's \"the rate across 50 replays went from 16% to 0%.\"\n\n``` js\nasync function reproductionRate(traceId: string, isBug: (o: string) => boolean, n = 50) {\n  const b = await bundleFromTrace(traceId);\n  const runs = await Promise.all(Array.from({ length: n }, () => replay(b)));\n  const hits = runs.filter(isBug).length;\n  return { rate: hits / n, hits, n }; // e.g. { rate: 0.16, hits: 8, n: 50 }\n}\n```\n\nThis reframes the bug-fixing loop. You don't fix a failure and eyeball it once; you capture it, establish its rate, change the agent, prove the rate dropped, and keep the replay in your suite so it can never silently climb back. agent-eval runs the bundle; the AgentLens trace behind it tells you, when a replay *does* fail, which step diverged from the recorded path.\n\nYou don't need production to be deterministic. You need every run to be *reconstructable*:\n\nThe agents will keep producing failures you can't explain from the symptom alone. The difference between a team that fixes them and one that closes tickets with \"could not reproduce\" isn't model quality or prompt skill — it's whether the run left behind enough of itself to be run again. Capture the trace with AgentLens, replay-and-score it with agent-eval, and \"could not reproduce\" stops being a sentence you're allowed to write.", "url": "https://wpnews.pro/news/you-can-t-reproduce-your-agent-s-bugs-that-s-why-you-can-t-fix-them", "canonical_source": "https://dev.to/saurav_bhattacharya/you-cant-reproduce-your-agents-bugs-thats-why-you-cant-fix-them-223i", "published_at": "2026-06-24 01:05:27+00:00", "updated_at": "2026-06-24 01:43:49.527494+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "ai-safety"], "entities": ["AgentLens", "agent-eval"], "alternates": {"html": "https://wpnews.pro/news/you-can-t-reproduce-your-agent-s-bugs-that-s-why-you-can-t-fix-them", "markdown": "https://wpnews.pro/news/you-can-t-reproduce-your-agent-s-bugs-that-s-why-you-can-t-fix-them.md", "text": "https://wpnews.pro/news/you-can-t-reproduce-your-agent-s-bugs-that-s-why-you-can-t-fix-them.txt", "jsonld": "https://wpnews.pro/news/you-can-t-reproduce-your-agent-s-bugs-that-s-why-you-can-t-fix-them.jsonld"}}