{"slug": "our-retry-loop-made-an-outage-worse-the-circuit-breaker-stopped-the-cascade", "title": "Our retry loop made an outage worse. The circuit breaker stopped the cascade.", "summary": "An overly aggressive retry policy in an agent service worsened a partial API outage from Anthropic, causing 18,000 wasted calls and 9 extra minutes of recovery time. The author created a lightweight Rust circuit breaker crate (under 400 lines) that uses a simple state machine—Closed, Open, and HalfOpen—to immediately block retries when failures exceed a threshold, preventing cascading failures. The circuit breaker is designed to work alongside a retry library, with shared breakers recommended for multi-worker setups to protect all workers from upstream failures.", "body_md": "A few weeks back there was a 22-minute window where Anthropic returned a high rate of 5xx responses. Not a full outage. Degraded.\n\nOur agent service had a retry policy that backed off and tried again on 5xx. With six workers and a shared retry budget that I had set too high, we were re-issuing failed calls about as fast as the API was returning errors. When the API recovered, we had a backlog of in-flight retries that pushed us straight back into rate limiting.\n\nTotal cost of the bad decision: about 18,000 wasted Anthropic calls and 9 minutes of additional recovery time after their side came back. Nothing user-visible blew up, but I felt bad about it.\n\nI wrote `llm-circuit-breaker`\n\nthe next day. It is small. The whole crate is under 400 lines of Rust. It pairs with `llm-retry`\n\n.\n\n## The state machine\n\n``` php\n              failures >= threshold\n   +-------+ ----------------------> +------+\n   | Closed|                          | Open |\n   +-------+ <---------------------- +------+\n       ^      half_open success       |\n       |                               |\n       |     half_open failure         v\n       |   <----------------------- +-----------+\n       +---------------------------- | HalfOpen  |\n            cooldown elapsed         +-----------+\n```\n\n-\n**Closed**: calls go through. Failures are counted. -** Open**: calls return`BreakerError::Open`\n\nimmediately without hitting the API. After a cooldown, the breaker transitions to HalfOpen. -**HalfOpen**: exactly one trial call is allowed through. If it succeeds, back to Closed. If it fails, back to Open with the cooldown reset.\n\nThat is the entire state machine. No leaky bucket. No fancy sliding window. Just enough to stop a runaway retry from making a partial outage into a full one.\n\n## What it looks like in code\n\n```\nuse llm_circuit_breaker::{Breaker, BreakerConfig};\nuse std::time::Duration;\n\nlet breaker = Breaker::new(BreakerConfig {\n    failure_threshold: 5,\n    success_threshold: 1,\n    cooldown: Duration::from_secs(30),\n});\n\nlet result = breaker.call(|| async {\n    client.messages().create(payload).await\n}).await;\n\nmatch result {\n    Ok(resp) => handle(resp),\n    Err(BreakerError::Open) => {\n        // skip the call entirely, return cached or fallback\n    }\n    Err(BreakerError::Inner(e)) => {\n        // the upstream failed; the breaker counted it\n    }\n}\n```\n\nEverything is `Arc<Mutex<...>>`\n\ninside. Safe to share across tasks. There is also an `is_open()`\n\nfor cheap pre-check without taking the lock long.\n\n## Threshold tuning, honestly\n\nI tuned the numbers by accident at first and got them wrong. Here is what worked after a few iterations.\n\nFor Anthropic `messages.create`\n\nfrom a single-worker process:\n\n-\n`failure_threshold: 5`\n\n. Three is too jumpy. Ten is too slow. Five is a few seconds of bad calls before tripping. -\n`cooldown: Duration::from_secs(30)`\n\n. Long enough that you do not spam the half-open probe. Short enough that recovery is fast. -\n`success_threshold: 1`\n\n. One good response is enough to close. The breaker is not a health system, it is a stop-loss.\n\nFor a multi-worker pool sharing one breaker (which is what we have in production now):\n\n- Scale the failure threshold by sqrt of worker count, not linearly. Six workers do not need 30 failures to trip. They need maybe 12.\n- Keep the cooldown the same. Cooldown is about the upstream, not your side.\n\nWhat I learned the hard way: a per-worker breaker is not the same as a shared breaker. Per-worker, every worker has to learn the upstream is down independently. Shared, one worker's failure protects the others. We use one shared breaker per upstream provider now.\n\n## Composing with retry\n\n`llm-retry`\n\ndoes exponential backoff with jitter. `llm-circuit-breaker`\n\ncuts off retries when the breaker is open. The two together prevent both the rare-flake case (retry handles it) and the cascading-failure case (breaker handles it).\n\n``` js\nuse llm_retry::{retry, RetryConfig};\nuse llm_circuit_breaker::Breaker;\n\nlet cfg = RetryConfig {\n    max_attempts: 4,\n    base_delay: Duration::from_millis(500),\n    max_delay: Duration::from_secs(8),\n    ..Default::default()\n};\n\nretry(cfg, || async {\n    breaker.call(|| async { client.messages().create(p.clone()).await }).await\n}).await\n```\n\nIf the breaker is open, the inner closure returns `BreakerError::Open`\n\nimmediately. `llm-retry`\n\nsees that as non-retryable (you configure it that way) and bails out fast. No retry storm.\n\n## Numbers from the dry-run\n\nI ran a simulation against a mock server that returns 5xx for the first 60 seconds and then 200s.\n\n- Without breaker, with a 4-attempt retry policy: 1,140 wasted requests during the bad window, full backlog hit on recovery.\n- With breaker (\n`threshold=5, cooldown=30s`\n\n): 19 wasted requests, recovered cleanly with one half-open probe.\n\nThat is the difference between \"we noticed the outage\" and \"we made the outage worse for ourselves.\"\n\n## What this does not solve\n\n- It does not detect that an upstream is slow but not erroring. If every call takes 25 seconds but eventually returns 200, the breaker stays closed. You would want a latency-based tripwire for that, and I have not added one.\n`agenttrace-rs`\n\nwill at least surface the p95 so you notice. - It is not coordinated across processes. Each replica of your service has its own breaker. If you want a globally-coordinated breaker, you need a shared store (Redis, etc.) and this crate is not it.\n- The cooldown is fixed. There is no adaptive cooldown that grows on repeated trips. I considered it and decided fixed-cooldown is honest enough.\n- Half-open allows exactly one probe. If your traffic is very high and you do not want the recovery to bottleneck on a single probe, you would want a token-bucket half-open. Not implemented.\n\nThe crate has no async-runtime lock-in. It works under tokio, async-std, or sync (the closure shape changes).\n\nRepo: [https://github.com/MukundaKatta/llm-circuit-breaker](https://github.com/MukundaKatta/llm-circuit-breaker)\n\ncrates.io: `llm-circuit-breaker = \"0.1\"`\n\nPart of a small stack of Rust crates I publish for the unglamorous LLM plumbing (retry, budget, repair, cost, trace). Pairs cleanly with `llm-retry`\n\nand `agenttrace-rs`\n\n.", "url": "https://wpnews.pro/news/our-retry-loop-made-an-outage-worse-the-circuit-breaker-stopped-the-cascade", "canonical_source": "https://dev.to/mukundakatta/our-retry-loop-made-an-outage-worse-the-circuit-breaker-stopped-the-cascade-4n47", "published_at": "2026-05-21 01:52:16+00:00", "updated_at": "2026-05-21 02:02:17.031763+00:00", "lang": "en", "topics": ["developer-tools", "large-language-models", "artificial-intelligence", "open-source", "cloud-computing"], "entities": ["Anthropic", "llm-circuit-breaker", "llm-retry", "Rust"], "alternates": {"html": "https://wpnews.pro/news/our-retry-loop-made-an-outage-worse-the-circuit-breaker-stopped-the-cascade", "markdown": "https://wpnews.pro/news/our-retry-loop-made-an-outage-worse-the-circuit-breaker-stopped-the-cascade.md", "text": "https://wpnews.pro/news/our-retry-loop-made-an-outage-worse-the-circuit-breaker-stopped-the-cascade.txt", "jsonld": "https://wpnews.pro/news/our-retry-loop-made-an-outage-worse-the-circuit-breaker-stopped-the-cascade.jsonld"}}