Your Agent's Retries Are Double-Charging Your Users (and Every Eval Is Green) A developer warns that agent retries can cause double execution of side effects like billing or email, leading to customer overcharges. They argue that side-effect safety is a Tier 1 evaluation problem requiring idempotency keys derived from intent, not model-generated. The developer recommends using tools like agent-eval and AgentLens to detect and prevent such failures. Your agent calls a tool. The tool times out at the network layer but actually succeeds on the server. Your harness sees no response, so it retries. Now charge customer ran twice, send email fired twice, and create ticket left two tickets. The model did nothing wrong. Every eval you have is green. And a customer just got billed $198 for a $99 plan. This is the failure mode nobody puts in a demo, because demos don't retry and demos don't have side effects that matter. Production has both. If your agent takes actions — not just generates text — then retry safety is not a nice-to-have, it is the difference between an autonomous system and a liability with a scheduler. I want to argue two things. First: side-effect safety is a Tier 1 evaluation problem , not a prompt problem. Second: you cannot even see this class of bug without a trace of what the agent actually did, which is where the eval story and the observability story become the same story. The instinct is to make the agent smarter. "Tell it to check whether the charge already went through before retrying." Please don't. You are asking a non-deterministic component to enforce an invariant that must hold every single time . The model will comply 95% of the time and the other 5% is a chargeback. Retries don't come from the model anyway. They come from your harness, your HTTP client, your queue's at-least-once delivery, a Kubernetes pod restart mid-execution. The agent's "reasoning" is nowhere near the retry. So no amount of judging the agent's output tells you whether the effect happened once or twice. This is exactly why I think about evidence on an independence axis , not a cost axis. Evidence is only worth what the agent couldn't forge: Double-execution lives entirely in Tier 1. It's binary, it's forgery-proof, and it's the 80% of production incidents that never needed an LLM to catch. The correct architecture makes double-execution impossible , then evaluates that the invariant held. The key insight: the idempotency key is derived by the harness from the intent, not minted by the model on each attempt. js import { createHash } from "node:crypto"; type ToolCall = { tool: string; args: Record