Our retry loop made an outage worse. The circuit breaker stopped the cascade. An overly aggressive retry policy in an agent service worsened a partial API outage from Anthropic, causing 18,000 wasted calls and 9 extra minutes of recovery time. The author created a lightweight Rust circuit breaker crate (under 400 lines) that uses a simple state machine—Closed, Open, and HalfOpen—to immediately block retries when failures exceed a threshold, preventing cascading failures. The circuit breaker is designed to work alongside a retry library, with shared breakers recommended for multi-worker setups to protect all workers from upstream failures. A few weeks back there was a 22-minute window where Anthropic returned a high rate of 5xx responses. Not a full outage. Degraded. Our agent service had a retry policy that backed off and tried again on 5xx. With six workers and a shared retry budget that I had set too high, we were re-issuing failed calls about as fast as the API was returning errors. When the API recovered, we had a backlog of in-flight retries that pushed us straight back into rate limiting. Total cost of the bad decision: about 18,000 wasted Anthropic calls and 9 minutes of additional recovery time after their side came back. Nothing user-visible blew up, but I felt bad about it. I wrote llm-circuit-breaker the next day. It is small. The whole crate is under 400 lines of Rust. It pairs with llm-retry . The state machine php failures = threshold +-------+ ---------------------- +------+ | Closed| | Open | +-------+ <---------------------- +------+ ^ half open success | | | | half open failure v | <----------------------- +-----------+ +---------------------------- | HalfOpen | cooldown elapsed +-----------+ - Closed : calls go through. Failures are counted. - Open : calls return BreakerError::Open immediately without hitting the API. After a cooldown, the breaker transitions to HalfOpen. - HalfOpen : exactly one trial call is allowed through. If it succeeds, back to Closed. If it fails, back to Open with the cooldown reset. That is the entire state machine. No leaky bucket. No fancy sliding window. Just enough to stop a runaway retry from making a partial outage into a full one. What it looks like in code use llm circuit breaker::{Breaker, BreakerConfig}; use std::time::Duration; let breaker = Breaker::new BreakerConfig { failure threshold: 5, success threshold: 1, cooldown: Duration::from secs 30 , } ; let result = breaker.call || async { client.messages .create payload .await } .await; match result { Ok resp = handle resp , Err BreakerError::Open = { // skip the call entirely, return cached or fallback } Err BreakerError::Inner e = { // the upstream failed; the breaker counted it } } Everything is Arc