# Your AI Agent Will Double-Charge on a Lost Response

> Source: <https://dev.to/0012303/your-ai-agent-will-double-charge-on-a-lost-response-5eed>
> Published: 2026-06-15 18:21:13+00:00

If your agent calls a tool that charges a card, and the transport drops the *response*, your agent didn't fail safely. It double-charged the customer, and it has no idea.

That's the whole bug. The money already moved. The agent never heard "ok," so it did what every well-behaved retry loop does: it tried again. Same prompt, same tool, same arguments. A second charge.

**TL;DR**

Open any agent framework and you'll find retry logic. Exponential backoff. Jitter. A max-attempts ceiling. All of it built for one failure mode: *the request didn't arrive.*

That's a fine default. It's also the wrong default the moment a tool has a side effect.

That logic is correct for reads. If `GET /reviews?page=4`

times out, retrying is free and obviously right. Read it again, no harm.

It is quietly wrong for writes. There are two different ways a tool call can fail, and they look identical to the caller:

From the agent's seat, both look like the same thing: a tool call with no result. A timeout. A dropped socket. A 502 from a proxy that already forwarded your POST upstream. The agent cannot tell case 1 from case 2 by looking at the failure. The information it needs is on the *other* side of the wire, and that's exactly the side it couldn't reach.

So backoff doesn't help you here. Backoff decides *when* to retry. It never decides *whether the side effect already fired.* That second question is the only one that matters for a `charge`

, a `send_email`

, a `create_refund`

, a `POST /orders`

. The contrarian bit, said plainly: **a write retry is a question about semantics, not about the network.** Tuning the network knobs harder just makes you double-charge on a slower, more polite schedule.

Distributed systems people have three delivery guarantees, and the names are worth getting right because agent docs use them loosely.

An idempotency ledger buys you the middle one cleanly: at-most-once for the *side effect*, on top of at-least-once *attempts*, at the boundary where the ledger sits. The attempts can fire as often as the network forces them to. The side effect fires once, because the second attempt finds a recorded result and replays it instead of re-running. The catch, which I unpack later: if the side effect lives on the *other* side of a wire you don't own, the boundary that matters is the provider's, not yours.

Reads stay at-least-once. Writes with a real side effect move to at-most-once. That's the whole design decision.

This is not my invention. It's the same mechanism Stripe ships in its public API. Their words:

"Stripe's idempotency works by saving the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeds or fails. Subsequent requests with the same key return the same result, including

`500`

errors."

Source:[Stripe API docs, idempotent requests].

Read that twice. They save the *result*, status and body, and replay it. They don't re-run the charge. An AI agent's write tool needs the exact same contract, and most of them don't have it yet.

I've written before about resuming a scraper that died at row 12,000 without re-writing rows, about conditional GET to skip re-downloading unchanged pages, and about an agent re-reading every page it already saw. Those are all about **reading or rewriting your own data**: making a *resume* clean, making a *read* cheap.

This is a different animal. This is about a tool call with an **external** side effect you do not own and cannot undo by truncating a file: a payment, an email, a refund, an order placed in someone else's system. You can't "resume" a charge by checking which rows you already wrote. The money is gone the instant the side effect fires. The fix isn't a file offset. It's a key that recognizes "I already did this exact action" and hands back the original answer.

And here's the part that decides *where* the key goes. If the side effect is external and you don't own it (Stripe charging a card), the dedup has to happen on the *callee's* side. The provider has to see your key, recognize the repeat, and refuse to charge again. A ledger sitting in front of your own process can't help with the lost-response case: the remote charge already fired, your ledger recorded nothing, and the retry walks right past it into a second charge. That's exactly the opening bug. For your *own* side effects (a row you write, a job you enqueue, a service you control end to end), a ledger you own is the whole fix, because you control the boundary the key is checked at. Keep those two cases apart; the demo below collapses them on purpose, into one process, to make the mechanism visible. Same family as retry hygiene, completely different failure and completely different fix.

Here's a self-contained simulation. No network, no dependencies, just `hashlib`

and `json`

from the standard library, so you can run it in five seconds and watch the numbers. A toy `PaymentAPI`

has a real side effect (a balance and a call counter). We run 100 orders at $19.99. The transport "loses the response" on every 5th call, so 20 of the 100 calls get retried.

The naive runtime retries by just calling `charge`

again. The ledger runtime keys each logical action and replays the recorded result on a retry.

```
"""
at-most-once tool calls for AI agents: naive retry vs idempotency ledger.

Deterministic, stdlib-only (hashlib, json). No network, no external deps.
Run:  python3 idempotency_ledger_demo.py

Scenario: an agent calls a write tool (charge a card) 100 times. The transport
loses the RESPONSE on every 5th call -- the side effect already happened, the
agent just never heard back. The naive runtime retries the action; the ledger
runtime replays the recorded result instead of re-running the side effect.
"""

import hashlib
import json

class PaymentAPI:
    """Toy external service with a REAL side effect (balance + call counter)."""

    def __init__(self):
        self.balance_cents = 0
        self.side_effect_calls = 0  # every real charge increments this

    def charge(self, order_id, amount_cents):
        # This is the side effect. It runs on EVERY call -- that's the danger.
        self.side_effect_calls += 1
        self.balance_cents += amount_cents
        return {"order_id": order_id, "charged_cents": amount_cents, "status": "ok"}

def idem_key(workflow_id, step, args):
    """Stable key for one logical action. Same inputs -> same key, always."""
    payload = json.dumps([workflow_id, step, args], sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()[:16]

def run_naive(orders, lost_every):
    """Naive runtime: on a lost response, just retry the call. Double-spend."""
    api = PaymentAPI()
    duplicate_charges = 0
    for i, (order_id, amount) in enumerate(orders, start=1):
        api.charge(order_id, amount)          # first attempt: side effect fires
        if i % lost_every == 0:
            # response was lost -> retry -> side effect fires AGAIN
            api.charge(order_id, amount)
            duplicate_charges += 1
    return api, duplicate_charges

def run_ledger(orders, lost_every):
    """Ledger runtime: key the action; replay recorded result on retry."""
    api = PaymentAPI()
    ledger = {}  # idem_key -> recorded result (in-memory; see caveat in article)
    duplicate_charges = 0

    def call_once(workflow_id, step, order_id, amount):
        key = idem_key(workflow_id, step, [order_id, amount])
        if key in ledger:
            return ledger[key], True   # replay: no side effect
        result = api.charge(order_id, amount)
        ledger[key] = result           # record BEFORE the response can be lost
        return result, False

    for i, (order_id, amount) in enumerate(orders, start=1):
        call_once("wf-checkout", "charge", order_id, amount)
        if i % lost_every == 0:
            _, replayed = call_once("wf-checkout", "charge", order_id, amount)
            if not replayed:
                duplicate_charges += 1  # would mean a real double-spend
    return api, duplicate_charges

def main():
    N = 100
    PRICE_CENTS = 1999      # $19.99
    LOST = 5                # response lost on every 5th call -> 20 retries
    orders = [(f"order-{i:03d}", PRICE_CENTS) for i in range(N)]
    expected_cents = N * PRICE_CENTS
    retries = N // LOST

    naive_api, naive_dup = run_naive(orders, LOST)
    ledger_api, ledger_dup = run_ledger(orders, LOST)

    def dollars(c):
        return c / 100

    print(f"Scenario: {N} orders @ $19.99, response lost on every {LOST}th "
          f"(so {retries} retries)\n")
    print(f"NAIVE    orders={N}  side_effect_calls={naive_api.side_effect_calls:>4}  "
          f"balance=${dollars(naive_api.balance_cents):>8.2f}  "
          f"expected=${dollars(expected_cents):>8.2f}  duplicate_charges={naive_dup}")
    print(f"LEDGER   orders={N}  side_effect_calls={ledger_api.side_effect_calls:>4}  "
          f"balance=${dollars(ledger_api.balance_cents):>8.2f}  "
          f"expected=${dollars(expected_cents):>8.2f}  duplicate_charges={ledger_dup}")

    overcharge = naive_api.balance_cents - expected_cents
    print(f"\nNAIVE overcharged customers by ${dollars(overcharge):.2f} "
          f"({naive_dup} duplicate charges).")
    print(f"LEDGER overcharge: $0.00 ({ledger_dup} duplicate charges).  "
          f"Same retries, zero double-spend.")

    # Willison gate: assert the numbers, so the demo can't silently drift.
    assert naive_dup == retries
    assert naive_api.side_effect_calls == N + retries
    assert naive_api.balance_cents == expected_cents + retries * PRICE_CENTS
    assert ledger_dup == 0
    assert ledger_api.side_effect_calls == N
    assert ledger_api.balance_cents == expected_cents
    print("\nAll asserts passed (deterministic).")

if __name__ == "__main__":
    main()
```

Run it. This is the exact output on my machine, copied straight from stdout, not retyped:

```
Scenario: 100 orders @ $19.99, response lost on every 5th (so 20 retries)

NAIVE    orders=100  side_effect_calls= 120  balance=$ 2398.80  expected=$ 1999.00  duplicate_charges=20
LEDGER   orders=100  side_effect_calls= 100  balance=$ 1999.00  expected=$ 1999.00  duplicate_charges=0

NAIVE overcharged customers by $399.80 (20 duplicate charges).
LEDGER overcharge: $0.00 (0 duplicate charges).  Same retries, zero double-spend.

All asserts passed (deterministic).
```

Same 20 retries in both runs. The ledger didn't retry *less*. It retried just as much, and still landed on the correct $1,999.00. The naive runtime sailed past it to $2,398.80 and never threw an error, because nothing *errored*. Every charge "succeeded." That's the part that makes this bug nasty: it's invisible until a customer emails you.

The whole mechanism is in `call_once`

. It's three lines of logic with one ordering rule that's easy to get wrong.

**1. Derive a stable key for the logical action.**

`idem_key("wf-checkout", "charge", [order_id, amount])`

hashes the workflow, the step, and the arguments. The word that matters is **2. Look before you leap.** If the key is already in the ledger, return the recorded result and *do not* touch the side effect. That's the replay. This is the same contract as Stripe returning the saved status and body.

**3. Record before the response can be lost.** Look at the order in

`call_once`

: we call `api.charge(...)`

, then immediately `ledger[key] = result`

, not after the response makes it back to the agent. Because the whole point is that the response That's it. No backoff change, no new framework. A dict in the demo; in production, a row with a unique constraint.

I run scrapers and data tools in production: 32 published actors, **2,190 lifetime runs** as of June 2026 (raw lifetime counter on my Apify profile, `apify.com/knotless_cadence`

; the Trustpilot one alone is past 962 runs). None of those charge a card. So why am I writing about payments?

Because at that volume you stop believing the happy path. Over thousands of runs, "the request finished but the acknowledgement got lost" stops being a textbook edge case and becomes a Tuesday. Proxies hang after forwarding. A worker gets OOM-killed between doing the work and writing "done." A 200 arrives for a body that never got read. The operational lesson that 2,190 runs beat into me isn't "add retries"; every framework has retries. It's **"a retry without a notion of identity is a bet that nothing irreversible happened on the last attempt,"** and on a long enough timeline that bet loses. For reads I lose nothing. The day an agent points that same naive retry at a `charge`

, the bet costs real money.

That's the bridge to agents. We're now wiring LLMs directly to write tools (`charge`

, `refund`

, `send`

, `book`

) and handing them the same naive retry loop that was always lurking under the reads. The blast radius just changed from "re-downloaded a page" to "billed a human twice."

I'd be lying if I sold you exactly-once. This is at-*most*-once, and it has sharp edges. Here's the honest list.

**It's at-most-once, not exactly-once, and only when the record is atomic with the effect.** Look at `call_once`

: it calls `api.charge(...)`

, *then* writes `ledger[key] = result`

. Those are two steps. If the process crashes in that gap, after the charge fired but before the ledger write lands, the retry finds no key and charges again. So the toy actually demonstrates the *replay* mechanism (key hit, recorded result, no re-run), not a crash-proof atomic commit. The at-most-once guarantee holds only if the record commits *atomically with* the side effect; the charge-then-write window is the exact hole where double-charge still lives, and it's the boundary between at-most-once and exactly-once. In production you close it with a two-phase write (reserve a pending row *before* the call, finalize it *after*) or by pushing the key down to a callee that dedups for you. At-most-once means you accept "maybe zero" to guarantee "never two." That's the right trade for money. It is not free, and it is not automatic.

**The key must be deterministic and stable, or none of this works.** I said it above; it's worth its own bullet because it's the #1 way people break this in practice. An LLM that regenerates its tool arguments on retry (re-sampling, re-formatting, adding a fresh `request_id`

) produces a new key and walks right past the ledger. Pin the key upstream, before the model can wobble it.

**Concurrency needs an atomic check-and-record.** My demo is single-threaded, so the `if key in ledger`

/ `ledger[key] = result`

gap is safe. In real life two retries can race into that gap simultaneously and both miss. You need an atomic operation: a unique constraint in Postgres, a conditional put, `INSERT ... ON CONFLICT DO NOTHING`

. Stripe is candid about this exact corner, and it's worth quoting because it's the failure people forget: *"If incoming parameters fail validation, or the request conflicts with another request that's executing concurrently, we don't save the idempotent result... You can retry these requests."* The race is real; handle it at the storage layer.

**The ledger must be persisted and pruned.** An in-memory dict dies with the process, and your dedup history dies with it, so a retry after a restart double-charges. Persist it. It also grows forever, so prune it on a TTL. Stripe prunes keys after 24 hours: *"You can remove keys from the system automatically after they're at least 24 hours old. We generate a new request if a key is reused after the original is pruned."* Pick a TTL longer than your worst retry window. Too short and a late retry sails past a pruned key.

**A too-coarse key fails the other direction: false dedup.** The determinism bullet warns about a key that's too *fragile* (different every attempt -> miss -> double-charge). The mirror image is just as real: a key that's too *coarse*. If you build it from `(amount, SKU)`

instead of the logical action, two genuinely different charges (the same customer buying the same item twice on purpose) collide on one key. The second one hits the ledger, looks like a duplicate, and gets silently swallowed. Now you've *lost* a legitimate charge. The key has to be unique per logical action (a stable order or checkout id), not per value. Stability and uniqueness both matter, and they cut in opposite directions.

**Don't record a transient failure as a final result.** Stripe saves the result "regardless of whether it succeeds or fails, including `500`

errors," and that's right for *their* boundary, where the key maps to one HTTP exchange. In your own ledger it's a trap if you're not deliberate. If the first attempt hit a timeout or a 500 that *didn't* actually charge, and you record that failure as the result, every retry until the TTL expires replays the frozen error instead of trying again. You've turned a transient failure into a permanent one. Decide per error class what counts as "final": a terminal result (charged, or hard-declined) gets recorded; a retryable transient one does not, so a fresh attempt can still run.

None of these are reasons to skip the ledger. They're reasons to build the *minimum* version correctly: a key that's both stable and unique per action, an atomic check-and-record, only terminal results recorded, persistence, a TTL.

If your agent has any tool with an irreversible side effect (payment, email, refund, an external POST), do three things this week.

`Idempotency-Key`

header (Stripe and many others do, so use theirs, don't reinvent it). For an external side effect this is what actually stops the double-charge, because the dedup happens where the charge happens.Backoff and jitter stay. They were never the problem. They just can't see the side effect, and the side effect is the whole game.

*Here's my open question, and I don't have a clean answer: where should the idempotency key actually live for an LLM agent? At the model layer (the agent commits to a key before it ever calls the tool), at the runtime layer (the framework derives it from the tool name + args), or only at the API boundary (let Stripe-style services own it and treat your own tools as unsafe)? I've shipped the runtime-layer version. I suspect the model-layer version is more correct and more fragile. If you've wired at-most-once tool calls into a real agent, I want to hear where you put the key, and what broke.*

*Follow for the next teardown from production, and if you've watched an agent double-fire a write tool, tell me what the side effect was and how you caught it. I read every comment.*

*AI-disclosure: drafted with an AI writing assistant, edited by a human. The Python above was run on my machine before publishing (Python 3, stdlib only); the output block is copied verbatim from stdout and the asserts pass deterministically.*
