Our retry loop made an outage worse. The circuit breaker stopped the cascade.

An overly aggressive retry policy in an agent service worsened a partial API outage from Anthropic, causing 18,000 wasted calls and 9 extra minutes of recovery time. The author created a lightweight Rust circuit breaker crate (under 400 lines) that uses a simple state machine—Closed, Open, and HalfOpen—to immediately block retries when failures exceed a threshold, preventing cascading failures. The circuit breaker is designed to work alongside a retry library, with shared breakers recommended for multi-worker setups to protect all workers from upstream failures.

A few weeks back there was a 22-minute window where Anthropic returned a high rate of 5xx responses. Not a full outage. Degraded. Our agent service had a retry policy that backed off and tried again on 5xx. With six workers and a shared retry budget that I had set too high, we were re-issuing failed calls about as fast as the API was returning errors. When the API recovered, we had a backlog of in-flight retries that pushed us straight back into rate limiting. Total cost of the bad decision: about 18,000 wasted Anthropic calls and 9 minutes of additional recovery time after their side came back. Nothing user-visible blew up, but I felt bad about it. I wrote llm-circuit-breaker the next day. It is small. The whole crate is under 400 lines of Rust. It pairs with llm-retry . The state machine php failures = threshold +-------+ ---------------------- +------+ | Closed| | Open | +-------+ <---------------------- +------+ ^ half open success | | | | half open failure v | <----------------------- +-----------+ +---------------------------- | HalfOpen | cooldown elapsed +-----------+ - Closed : calls go through. Failures are counted. - Open : calls return BreakerError::Open immediately without hitting the API. After a cooldown, the breaker transitions to HalfOpen. - HalfOpen : exactly one trial call is allowed through. If it succeeds, back to Closed. If it fails, back to Open with the cooldown reset. That is the entire state machine. No leaky bucket. No fancy sliding window. Just enough to stop a runaway retry from making a partial outage into a full one. What it looks like in code use llm circuit breaker::{Breaker, BreakerConfig}; use std::time::Duration; let breaker = Breaker::new BreakerConfig { failure threshold: 5, success threshold: 1, cooldown: Duration::from secs 30 , } ; let result = breaker.call || async { client.messages .create payload .await } .await; match result { Ok resp = handle resp , Err BreakerError::Open = { // skip the call entirely, return cached or fallback } Err BreakerError::Inner e = { // the upstream failed; the breaker counted it } } Everything is Arc<Mutex<... inside. Safe to share across tasks. There is also an is open for cheap pre-check without taking the lock long. Threshold tuning, honestly I tuned the numbers by accident at first and got them wrong. Here is what worked after a few iterations. For Anthropic messages.create from a single-worker process: - failure threshold: 5 . Three is too jumpy. Ten is too slow. Five is a few seconds of bad calls before tripping. - cooldown: Duration::from secs 30 . Long enough that you do not spam the half-open probe. Short enough that recovery is fast. - success threshold: 1 . One good response is enough to close. The breaker is not a health system, it is a stop-loss. For a multi-worker pool sharing one breaker which is what we have in production now : - Scale the failure threshold by sqrt of worker count, not linearly. Six workers do not need 30 failures to trip. They need maybe 12. - Keep the cooldown the same. Cooldown is about the upstream, not your side. What I learned the hard way: a per-worker breaker is not the same as a shared breaker. Per-worker, every worker has to learn the upstream is down independently. Shared, one worker's failure protects the others. We use one shared breaker per upstream provider now. Composing with retry llm-retry does exponential backoff with jitter. llm-circuit-breaker cuts off retries when the breaker is open. The two together prevent both the rare-flake case retry handles it and the cascading-failure case breaker handles it . js use llm retry::{retry, RetryConfig}; use llm circuit breaker::Breaker; let cfg = RetryConfig { max attempts: 4, base delay: Duration::from millis 500 , max delay: Duration::from secs 8 , ..Default::default }; retry cfg, || async { breaker.call || async { client.messages .create p.clone .await } .await } .await If the breaker is open, the inner closure returns BreakerError::Open immediately. llm-retry sees that as non-retryable you configure it that way and bails out fast. No retry storm. Numbers from the dry-run I ran a simulation against a mock server that returns 5xx for the first 60 seconds and then 200s. - Without breaker, with a 4-attempt retry policy: 1,140 wasted requests during the bad window, full backlog hit on recovery. - With breaker threshold=5, cooldown=30s : 19 wasted requests, recovered cleanly with one half-open probe. That is the difference between "we noticed the outage" and "we made the outage worse for ourselves." What this does not solve - It does not detect that an upstream is slow but not erroring. If every call takes 25 seconds but eventually returns 200, the breaker stays closed. You would want a latency-based tripwire for that, and I have not added one. agenttrace-rs will at least surface the p95 so you notice. - It is not coordinated across processes. Each replica of your service has its own breaker. If you want a globally-coordinated breaker, you need a shared store Redis, etc. and this crate is not it. - The cooldown is fixed. There is no adaptive cooldown that grows on repeated trips. I considered it and decided fixed-cooldown is honest enough. - Half-open allows exactly one probe. If your traffic is very high and you do not want the recovery to bottleneck on a single probe, you would want a token-bucket half-open. Not implemented. The crate has no async-runtime lock-in. It works under tokio, async-std, or sync the closure shape changes . Repo: https://github.com/MukundaKatta/llm-circuit-breaker https://github.com/MukundaKatta/llm-circuit-breaker crates.io: llm-circuit-breaker = "0.1" Part of a small stack of Rust crates I publish for the unglamorous LLM plumbing retry, budget, repair, cost, trace . Pairs cleanly with llm-retry and agenttrace-rs .