Your AI Agent Will Double-Charge on a Lost Response

A developer warns that AI agents with retry logic can double-charge customers when a network failure drops the response to a write operation like a payment. The agent, unable to distinguish between a failed request and a lost response, retries the same side effect, causing duplicate charges. The solution is to use idempotency keys, as Stripe does, to ensure side effects fire at most once.

If your agent calls a tool that charges a card, and the transport drops the response , your agent didn't fail safely. It double-charged the customer, and it has no idea. That's the whole bug. The money already moved. The agent never heard "ok," so it did what every well-behaved retry loop does: it tried again. Same prompt, same tool, same arguments. A second charge. TL;DR Open any agent framework and you'll find retry logic. Exponential backoff. Jitter. A max-attempts ceiling. All of it built for one failure mode: the request didn't arrive. That's a fine default. It's also the wrong default the moment a tool has a side effect. That logic is correct for reads. If GET /reviews?page=4 times out, retrying is free and obviously right. Read it again, no harm. It is quietly wrong for writes. There are two different ways a tool call can fail, and they look identical to the caller: From the agent's seat, both look like the same thing: a tool call with no result. A timeout. A dropped socket. A 502 from a proxy that already forwarded your POST upstream. The agent cannot tell case 1 from case 2 by looking at the failure. The information it needs is on the other side of the wire, and that's exactly the side it couldn't reach. So backoff doesn't help you here. Backoff decides when to retry. It never decides whether the side effect already fired. That second question is the only one that matters for a charge , a send email , a create refund , a POST /orders . The contrarian bit, said plainly: a write retry is a question about semantics, not about the network. Tuning the network knobs harder just makes you double-charge on a slower, more polite schedule. Distributed systems people have three delivery guarantees, and the names are worth getting right because agent docs use them loosely. An idempotency ledger buys you the middle one cleanly: at-most-once for the side effect , on top of at-least-once attempts , at the boundary where the ledger sits. The attempts can fire as often as the network forces them to. The side effect fires once, because the second attempt finds a recorded result and replays it instead of re-running. The catch, which I unpack later: if the side effect lives on the other side of a wire you don't own, the boundary that matters is the provider's, not yours. Reads stay at-least-once. Writes with a real side effect move to at-most-once. That's the whole design decision. This is not my invention. It's the same mechanism Stripe ships in its public API. Their words: "Stripe's idempotency works by saving the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeds or fails. Subsequent requests with the same key return the same result, including 500 errors." Source: Stripe API docs, idempotent requests . Read that twice. They save the result , status and body, and replay it. They don't re-run the charge. An AI agent's write tool needs the exact same contract, and most of them don't have it yet. I've written before about resuming a scraper that died at row 12,000 without re-writing rows, about conditional GET to skip re-downloading unchanged pages, and about an agent re-reading every page it already saw. Those are all about reading or rewriting your own data : making a resume clean, making a read cheap. This is a different animal. This is about a tool call with an external side effect you do not own and cannot undo by truncating a file: a payment, an email, a refund, an order placed in someone else's system. You can't "resume" a charge by checking which rows you already wrote. The money is gone the instant the side effect fires. The fix isn't a file offset. It's a key that recognizes "I already did this exact action" and hands back the original answer. And here's the part that decides where the key goes. If the side effect is external and you don't own it Stripe charging a card , the dedup has to happen on the callee's side. The provider has to see your key, recognize the repeat, and refuse to charge again. A ledger sitting in front of your own process can't help with the lost-response case: the remote charge already fired, your ledger recorded nothing, and the retry walks right past it into a second charge. That's exactly the opening bug. For your own side effects a row you write, a job you enqueue, a service you control end to end , a ledger you own is the whole fix, because you control the boundary the key is checked at. Keep those two cases apart; the demo below collapses them on purpose, into one process, to make the mechanism visible. Same family as retry hygiene, completely different failure and completely different fix. Here's a self-contained simulation. No network, no dependencies, just hashlib and json from the standard library, so you can run it in five seconds and watch the numbers. A toy PaymentAPI has a real side effect a balance and a call counter . We run 100 orders at $19.99. The transport "loses the response" on every 5th call, so 20 of the 100 calls get retried. The naive runtime retries by just calling charge again. The ledger runtime keys each logical action and replays the recorded result on a retry. """ at-most-once tool calls for AI agents: naive retry vs idempotency ledger. Deterministic, stdlib-only hashlib, json . No network, no external deps. Run: python3 idempotency ledger demo.py Scenario: an agent calls a write tool charge a card 100 times. The transport loses the RESPONSE on every 5th call -- the side effect already happened, the agent just never heard back. The naive runtime retries the action; the ledger runtime replays the recorded result instead of re-running the side effect. """ import hashlib import json class PaymentAPI: """Toy external service with a REAL side effect balance + call counter .""" def init self : self.balance cents = 0 self.side effect calls = 0 every real charge increments this def charge self, order id, amount cents : This is the side effect. It runs on EVERY call -- that's the danger. self.side effect calls += 1 self.balance cents += amount cents return {"order id": order id, "charged cents": amount cents, "status": "ok"} def idem key workflow id, step, args : """Stable key for one logical action. Same inputs - same key, always.""" payload = json.dumps workflow id, step, args , sort keys=True return hashlib.sha256 payload.encode .hexdigest :16 def run naive orders, lost every : """Naive runtime: on a lost response, just retry the call. Double-spend.""" api = PaymentAPI duplicate charges = 0 for i, order id, amount in enumerate orders, start=1 : api.charge order id, amount first attempt: side effect fires if i % lost every == 0: response was lost - retry - side effect fires AGAIN api.charge order id, amount duplicate charges += 1 return api, duplicate charges def run ledger orders, lost every : """Ledger runtime: key the action; replay recorded result on retry.""" api = PaymentAPI ledger = {} idem key - recorded result in-memory; see caveat in article duplicate charges = 0 def call once workflow id, step, order id, amount : key = idem key workflow id, step, order id, amount if key in ledger: return ledger key , True replay: no side effect result = api.charge order id, amount ledger key = result record BEFORE the response can be lost return result, False for i, order id, amount in enumerate orders, start=1 : call once "wf-checkout", "charge", order id, amount if i % lost every == 0: , replayed = call once "wf-checkout", "charge", order id, amount if not replayed: duplicate charges += 1 would mean a real double-spend return api, duplicate charges def main : N = 100 PRICE CENTS = 1999 $19.99 LOST = 5 response lost on every 5th call - 20 retries orders = f"order-{i:03d}", PRICE CENTS for i in range N expected cents = N PRICE CENTS retries = N // LOST naive api, naive dup = run naive orders, LOST ledger api, ledger dup = run ledger orders, LOST def dollars c : return c / 100 print f"Scenario: {N} orders @ $19.99, response lost on every {LOST}th " f" so {retries} retries \n" print f"NAIVE orders={N} side effect calls={naive api.side effect calls: 4} " f"balance=${dollars naive api.balance cents : 8.2f} " f"expected=${dollars expected cents : 8.2f} duplicate charges={naive dup}" print f"LEDGER orders={N} side effect calls={ledger api.side effect calls: 4} " f"balance=${dollars ledger api.balance cents : 8.2f} " f"expected=${dollars expected cents : 8.2f} duplicate charges={ledger dup}" overcharge = naive api.balance cents - expected cents print f"\nNAIVE overcharged customers by ${dollars overcharge :.2f} " f" {naive dup} duplicate charges ." print f"LEDGER overcharge: $0.00 {ledger dup} duplicate charges . " f"Same retries, zero double-spend." Willison gate: assert the numbers, so the demo can't silently drift. assert naive dup == retries assert naive api.side effect calls == N + retries assert naive api.balance cents == expected cents + retries PRICE CENTS assert ledger dup == 0 assert ledger api.side effect calls == N assert ledger api.balance cents == expected cents print "\nAll asserts passed deterministic ." if name == " main ": main Run it. This is the exact output on my machine, copied straight from stdout, not retyped: Scenario: 100 orders @ $19.99, response lost on every 5th so 20 retries NAIVE orders=100 side effect calls= 120 balance=$ 2398.80 expected=$ 1999.00 duplicate charges=20 LEDGER orders=100 side effect calls= 100 balance=$ 1999.00 expected=$ 1999.00 duplicate charges=0 NAIVE overcharged customers by $399.80 20 duplicate charges . LEDGER overcharge: $0.00 0 duplicate charges . Same retries, zero double-spend. All asserts passed deterministic . Same 20 retries in both runs. The ledger didn't retry less . It retried just as much, and still landed on the correct $1,999.00. The naive runtime sailed past it to $2,398.80 and never threw an error, because nothing errored . Every charge "succeeded." That's the part that makes this bug nasty: it's invisible until a customer emails you. The whole mechanism is in call once . It's three lines of logic with one ordering rule that's easy to get wrong. 1. Derive a stable key for the logical action. idem key "wf-checkout", "charge", order id, amount hashes the workflow, the step, and the arguments. The word that matters is 2. Look before you leap. If the key is already in the ledger, return the recorded result and do not touch the side effect. That's the replay. This is the same contract as Stripe returning the saved status and body. 3. Record before the response can be lost. Look at the order in call once : we call api.charge ... , then immediately ledger key = result , not after the response makes it back to the agent. Because the whole point is that the response That's it. No backoff change, no new framework. A dict in the demo; in production, a row with a unique constraint. I run scrapers and data tools in production: 32 published actors, 2,190 lifetime runs as of June 2026 raw lifetime counter on my Apify profile, apify.com/knotless cadence ; the Trustpilot one alone is past 962 runs . None of those charge a card. So why am I writing about payments? Because at that volume you stop believing the happy path. Over thousands of runs, "the request finished but the acknowledgement got lost" stops being a textbook edge case and becomes a Tuesday. Proxies hang after forwarding. A worker gets OOM-killed between doing the work and writing "done." A 200 arrives for a body that never got read. The operational lesson that 2,190 runs beat into me isn't "add retries"; every framework has retries. It's "a retry without a notion of identity is a bet that nothing irreversible happened on the last attempt," and on a long enough timeline that bet loses. For reads I lose nothing. The day an agent points that same naive retry at a charge , the bet costs real money. That's the bridge to agents. We're now wiring LLMs directly to write tools charge , refund , send , book and handing them the same naive retry loop that was always lurking under the reads. The blast radius just changed from "re-downloaded a page" to "billed a human twice." I'd be lying if I sold you exactly-once. This is at- most -once, and it has sharp edges. Here's the honest list. It's at-most-once, not exactly-once, and only when the record is atomic with the effect. Look at call once : it calls api.charge ... , then writes ledger key = result . Those are two steps. If the process crashes in that gap, after the charge fired but before the ledger write lands, the retry finds no key and charges again. So the toy actually demonstrates the replay mechanism key hit, recorded result, no re-run , not a crash-proof atomic commit. The at-most-once guarantee holds only if the record commits atomically with the side effect; the charge-then-write window is the exact hole where double-charge still lives, and it's the boundary between at-most-once and exactly-once. In production you close it with a two-phase write reserve a pending row before the call, finalize it after or by pushing the key down to a callee that dedups for you. At-most-once means you accept "maybe zero" to guarantee "never two." That's the right trade for money. It is not free, and it is not automatic. The key must be deterministic and stable, or none of this works. I said it above; it's worth its own bullet because it's the 1 way people break this in practice. An LLM that regenerates its tool arguments on retry re-sampling, re-formatting, adding a fresh request id produces a new key and walks right past the ledger. Pin the key upstream, before the model can wobble it. Concurrency needs an atomic check-and-record. My demo is single-threaded, so the if key in ledger / ledger key = result gap is safe. In real life two retries can race into that gap simultaneously and both miss. You need an atomic operation: a unique constraint in Postgres, a conditional put, INSERT ... ON CONFLICT DO NOTHING . Stripe is candid about this exact corner, and it's worth quoting because it's the failure people forget: "If incoming parameters fail validation, or the request conflicts with another request that's executing concurrently, we don't save the idempotent result... You can retry these requests." The race is real; handle it at the storage layer. The ledger must be persisted and pruned. An in-memory dict dies with the process, and your dedup history dies with it, so a retry after a restart double-charges. Persist it. It also grows forever, so prune it on a TTL. Stripe prunes keys after 24 hours: "You can remove keys from the system automatically after they're at least 24 hours old. We generate a new request if a key is reused after the original is pruned." Pick a TTL longer than your worst retry window. Too short and a late retry sails past a pruned key. A too-coarse key fails the other direction: false dedup. The determinism bullet warns about a key that's too fragile different every attempt - miss - double-charge . The mirror image is just as real: a key that's too coarse . If you build it from amount, SKU instead of the logical action, two genuinely different charges the same customer buying the same item twice on purpose collide on one key. The second one hits the ledger, looks like a duplicate, and gets silently swallowed. Now you've lost a legitimate charge. The key has to be unique per logical action a stable order or checkout id , not per value. Stability and uniqueness both matter, and they cut in opposite directions. Don't record a transient failure as a final result. Stripe saves the result "regardless of whether it succeeds or fails, including 500 errors," and that's right for their boundary, where the key maps to one HTTP exchange. In your own ledger it's a trap if you're not deliberate. If the first attempt hit a timeout or a 500 that didn't actually charge, and you record that failure as the result, every retry until the TTL expires replays the frozen error instead of trying again. You've turned a transient failure into a permanent one. Decide per error class what counts as "final": a terminal result charged, or hard-declined gets recorded; a retryable transient one does not, so a fresh attempt can still run. None of these are reasons to skip the ledger. They're reasons to build the minimum version correctly: a key that's both stable and unique per action, an atomic check-and-record, only terminal results recorded, persistence, a TTL. If your agent has any tool with an irreversible side effect payment, email, refund, an external POST , do three things this week. Idempotency-Key header Stripe and many others do, so use theirs, don't reinvent it . For an external side effect this is what actually stops the double-charge, because the dedup happens where the charge happens.Backoff and jitter stay. They were never the problem. They just can't see the side effect, and the side effect is the whole game. Here's my open question, and I don't have a clean answer: where should the idempotency key actually live for an LLM agent? At the model layer the agent commits to a key before it ever calls the tool , at the runtime layer the framework derives it from the tool name + args , or only at the API boundary let Stripe-style services own it and treat your own tools as unsafe ? I've shipped the runtime-layer version. I suspect the model-layer version is more correct and more fragile. If you've wired at-most-once tool calls into a real agent, I want to hear where you put the key, and what broke. Follow for the next teardown from production, and if you've watched an agent double-fire a write tool, tell me what the side effect was and how you caught it. I read every comment. AI-disclosure: drafted with an AI writing assistant, edited by a human. The Python above was run on my machine before publishing Python 3, stdlib only ; the output block is copied verbatim from stdout and the asserts pass deterministically.