Last month's Anthropic invoice was $312. After one architectural change, May came in at $156 — exactly half. The culprit wasn't prompt bloat or model choice. It was the absence of compensating actions in my multi-step agent workflow.
The pattern is embarrassingly common: a 5-step pipeline fails at Step 4, so you restart from the top. Every restart re-runs every LLM call before the failure point. My ad analytics SaaS runs Claude Sonnet to summarize raw data in Step 2. That step averages ~8K input tokens per advertiser. At $3/M tokens (Sonnet 3.7), one restart costs $0.024 — trivial alone, but I have 200+ advertisers and this pipeline was failing repeatedly. Step 2 alone burned $40–50 in duplicated calls over April.
The deeper problem: I had no rollback mechanism at all. When Step 5 (a Slack webhook to an advertiser portal) failed with a 503 on a cold-start Worker, R2 already had the file, D1 already had the log row. Restarting the pipeline created duplicate files, duplicate database rows, and one advertiser asking why they got the same report twice. I'd assumed "restart = safe." That assumption was wrong.
The fix has two parts. First, I write a pipeline_runs
row at the start of every run, updating it with a step_completed
checkpoint and a step_output_ref
(the actual R2 key or D1 row ID) after each step succeeds. Second, on failure, a rollbackPipelineRun()
function reads those refs and deletes whatever was written — R2 file gone, D1 row gone, status flipped to rolled_back
. On retry, the agent checks for an existing in-progress run and skips already-completed steps entirely:
if (existingRun && existingRun.step_completed >= 2) {
summary = existingRun.cached_summary; // no Claude call
} else {
summary = await callClaude(data);
}
One thing idempotency keys don't solve here: they prevent duplicate side effects, but they don't prevent re-spending tokens on an identical LLM call. You need both — idempotency on the storage writes and checkpointed caching on the inference steps.
There are still rough edges: a race condition when two runs start simultaneously for the same advertiser (D1 doesn't fully guarantee serializable isolation between a SELECT and INSERT), and no clean answer for truly irreversible actions like sent emails or processed payments.
I wrote up the full breakdown — including the race condition I haven't fixed yet and why Durable Objects might be the answer — over on riversealab.com.