Your Agent's Retries Are Double-Charging Your Users (and Every Eval Is Green)

A developer warns that agent retries can cause double execution of side effects like billing or email, leading to customer overcharges. They argue that side-effect safety is a Tier 1 evaluation problem requiring idempotency keys derived from intent, not model-generated. The developer recommends using tools like agent-eval and AgentLens to detect and prevent such failures.

Your agent calls a tool. The tool times out at the network layer but actually succeeds on the server. Your harness sees no response, so it retries. Now charge customer ran twice, send email fired twice, and create ticket left two tickets. The model did nothing wrong. Every eval you have is green. And a customer just got billed $198 for a $99 plan. This is the failure mode nobody puts in a demo, because demos don't retry and demos don't have side effects that matter. Production has both. If your agent takes actions — not just generates text — then retry safety is not a nice-to-have, it is the difference between an autonomous system and a liability with a scheduler. I want to argue two things. First: side-effect safety is a Tier 1 evaluation problem , not a prompt problem. Second: you cannot even see this class of bug without a trace of what the agent actually did, which is where the eval story and the observability story become the same story. The instinct is to make the agent smarter. "Tell it to check whether the charge already went through before retrying." Please don't. You are asking a non-deterministic component to enforce an invariant that must hold every single time . The model will comply 95% of the time and the other 5% is a chargeback. Retries don't come from the model anyway. They come from your harness, your HTTP client, your queue's at-least-once delivery, a Kubernetes pod restart mid-execution. The agent's "reasoning" is nowhere near the retry. So no amount of judging the agent's output tells you whether the effect happened once or twice. This is exactly why I think about evidence on an independence axis , not a cost axis. Evidence is only worth what the agent couldn't forge: Double-execution lives entirely in Tier 1. It's binary, it's forgery-proof, and it's the 80% of production incidents that never needed an LLM to catch. The correct architecture makes double-execution impossible , then evaluates that the invariant held. The key insight: the idempotency key is derived by the harness from the intent, not minted by the model on each attempt. js import { createHash } from "node:crypto"; type ToolCall = { tool: string; args: Record<string, unknown ; runId: string }; // The key is a pure function of intent — identical across retries, // because the AGENT never generates it. The harness does. function idempotencyKey call: ToolCall : string { const canonical = JSON.stringify { tool: call.tool, args: call.args, runId: call.runId, // one logical action per run, not per attempt } ; return createHash "sha256" .update canonical .digest "hex" ; } async function executeOnce call: ToolCall, sideEffect: key: string = Promise<unknown { const key = idempotencyKey call ; // Tier 1 proof, checked BEFORE the effect: has this exact intent run? const prior = await ledger.get key ; if prior?.status === "committed" { return { key, result: prior.result, replayed: true }; // no second charge } await ledger.put key, { status: "in flight" } ; const result = await sideEffect key ; // pass key downstream to Stripe et al. await ledger.put key, { status: "committed", result } ; return { key, result, replayed: false }; } Now the retry is safe by construction : a second attempt with the same intent replays the recorded result instead of firing the effect again. But — and this is the part people skip — being safe is not the same as knowing you're safe. You still have to prove it in your evals. Here's where I'll stop describing generic hygiene and tell you what I actually run: agent-eval to score and gate the output, and AgentLens to capture the trace it scores against. They ship as a unit for a reason I only appreciated after getting burned. agent-eval owns the Tier 1 gate. After every run it asserts the invariant against ground truth the agent could not author: js import { evaluate } from "agent-eval"; const report = await evaluate run, { checks: // Tier 1: externally observable, unforgeable proof. { id: "single-charge", tier: 1, run: async { trace } = { const key = trace.toolCalls.find t = t.tool === "charge customer" ?.idempotencyKey; const hits = await stripe.charges.list { metadata: { key } } ; return { pass: hits.data.length === 1, detail: charges=${hits.data.length} }; }}, // Tier 2: statistical signal vs a baseline the agent didn't set. { id: "retry-rate", tier: 2, run: { trace } = { const retries = trace.toolCalls.filter t = t.replayed .length; return { pass: retries <= baseline.p95, detail: replays=${retries} }; }}, , } ; Notice trace.toolCalls and t.idempotencyKey . Where does that come from? AgentLens. It records every model step and every tool step — resolved inputs, the idempotency key the harness derived, the raw provider response, whether the attempt was a replay or a fresh effect. Without that trace, the "single-charge" check has nothing to read. The agent's own summary "I charged the customer once" is exactly the self-report you must not trust — it's shared-substrate, the agent authored it. That's the whole thesis of pairing them. Tier 1+2 only mean something if they run over data the agent didn't get to write . AgentLens produces the unforgeable substrate; agent-eval renders the verdict. One captures how the agent got there, the other decides whether "there" was correct — and critically, Tier 1+2 run over trajectories in real time at roughly $0, while the judge stays offline for the subjective tail where it belongs. You do not need a smarter model to stop double-charging customers. You need: Reserve the model-as-judge for the genuinely subjective 20% — "was this refund fair?" — and label its output opinion, not evidence . The retry storm quietly draining your customers' cards is not in that 20%. It never was. It's the most catchable bug you have, sitting in Tier 1, waiting for you to look at the trace.