Your Agent's Retries Are Double-Charging Your Users (and Every Eval Is Green)

wpnews.pro

Your agent calls a tool. The tool times out at the network layer but actually succeeds on the server. Your harness sees no response, so it retries. Now charge_customer

ran twice, send_email

fired twice, and create_ticket

left two tickets. The model did nothing wrong. Every eval you have is green. And a customer just got billed $198 for a $99 plan.

This is the failure mode nobody puts in a demo, because demos don't retry and demos don't have side effects that matter. Production has both. If your agent takes actions — not just generates text — then retry safety is not a nice-to-have, it is the difference between an autonomous system and a liability with a scheduler.

I want to argue two things. First: side-effect safety is a Tier 1 evaluation problem, not a prompt problem. Second: you cannot even see this class of bug without a trace of what the agent actually did, which is where the eval story and the observability story become the same story.

The instinct is to make the agent smarter. "Tell it to check whether the charge already went through before retrying." Please don't. You are asking a non-deterministic component to enforce an invariant that must hold every single time. The model will comply 95% of the time and the other 5% is a chargeback.

Retries don't come from the model anyway. They come from your harness, your HTTP client, your queue's at-least-once delivery, a Kubernetes pod restart mid-execution. The agent's "reasoning" is nowhere near the retry. So no amount of judging the agent's output tells you whether the effect happened once or twice.

This is exactly why I think about evidence on an independence axis, not a cost axis. Evidence is only worth what the agent couldn't forge:

Double-execution lives entirely in Tier 1. It's binary, it's forgery-proof, and it's the 80% of production incidents that never needed an LLM to catch.

The correct architecture makes double-execution impossible, then evaluates that the invariant held. The key insight: the idempotency key is derived by the harness from the intent, not minted by the model on each attempt.

import { createHash } from "node:crypto";

type ToolCall = { tool: string; args: Record<string, unknown>; runId: string };

// The key is a pure function of intent — identical across retries,
// because the AGENT never generates it. The harness does.
function idempotencyKey(call: ToolCall): string {
  const canonical = JSON.stringify({
    tool: call.tool,
    args: call.args,
    runId: call.runId, // one logical action per run, not per attempt
  });
  return createHash("sha256").update(canonical).digest("hex");
}

async function executeOnce(call: ToolCall, sideEffect: (key: string) => Promise<unknown>) {
  const key = idempotencyKey(call);

  // Tier 1 proof, checked BEFORE the effect: has this exact intent run?
  const prior = await ledger.get(key);
  if (prior?.status === "committed") {
    return { key, result: prior.result, replayed: true }; // no second charge
  }

  await ledger.put(key, { status: "in_flight" });
  const result = await sideEffect(key); // pass key downstream to Stripe et al.
  await ledger.put(key, { status: "committed", result });

  return { key, result, replayed: false };
}

Now the retry is safe by construction: a second attempt with the same intent replays the recorded result instead of firing the effect again. But — and this is the part people skip — being safe is not the same as knowing you're safe. You still have to prove it in your evals.

Here's where I'll stop describing generic hygiene and tell you what I actually run: agent-eval to score and gate the output, and AgentLens to capture the trace it scores against. They ship as a unit for a reason I only appreciated after getting burned.

agent-eval owns the Tier 1 gate. After every run it asserts the invariant against ground truth the agent could not author:

import { evaluate } from "agent-eval";

const report = await evaluate(run, {
  checks: [
    // Tier 1: externally observable, unforgeable proof.
    { id: "single-charge", tier: 1, run: async ({ trace }) => {
        const key = trace.toolCalls.find(t => t.tool === "charge_customer")?.idempotencyKey;
        const hits = await stripe.charges.list({ metadata: { key } });
        return { pass: hits.data.length === 1, detail: `charges=${hits.data.length}` };
    }},
    // Tier 2: statistical signal vs a baseline the agent didn't set.
    { id: "retry-rate", tier: 2, run: ({ trace }) => {
        const retries = trace.toolCalls.filter(t => t.replayed).length;
        return { pass: retries <= baseline.p95, detail: `replays=${retries}` };
    }},
  ],
});

Notice trace.toolCalls

and t.idempotencyKey

. Where does that come from? AgentLens. It records every model step and every tool step — resolved inputs, the idempotency key the harness derived, the raw provider response, whether the attempt was a replay or a fresh effect. Without that trace, the "single-charge" check has nothing to read. The agent's own summary ("I charged the customer once") is exactly the self-report you must not trust — it's shared-substrate, the agent authored it.

That's the whole thesis of pairing them. Tier 1+2 only mean something if they run over data the agent didn't get to write. AgentLens produces the unforgeable substrate; agent-eval renders the verdict. One captures how the agent got there, the other decides whether "there" was correct — and critically, Tier 1+2 run over trajectories in real time at roughly $0, while the judge stays offline for the subjective tail where it belongs.

You do not need a smarter model to stop double-charging customers. You need:

Reserve the model-as-judge for the genuinely subjective 20% — "was this refund fair?" — and label its output opinion, not evidence. The retry storm quietly draining your customers' cards is not in that 20%. It never was. It's the most catchable bug you have, sitting in Tier 1, waiting for you to look at the trace.

source & further reading

dev.to — original article จาก chatbot ธรรมดา สู่ AI ที่ทำงานแทนเราได้ — เราเติมอะไรเข้าไปบ้าง? Tell your AI agent if the ground is sinking: measured ground-motion with SibFly + LangChain My agent dry-ran fine in staging 100 times — then wrecked production on the first real run

Your Agent's Retries Are Double-Charging Your Users (and Every Eval Is Green)

Run your AI side-project on zahid.host