Fault-injecting our LLM provider to trust Bifrost fallbacks

Buildkite ran a game day that fault-injected OpenAI with 429s and 500s to test whether Bifrost's fallback config would reroute requests for an LLM-backed build-failure summariser. After fixing a retry ceiling and adding a request timeout, the gateway successfully rerouted to Anthropic's Claude Haiku, preventing any user-visible failures. The exercise demonstrated that slow responses, not just errors, must be treated as failures to trigger fallbacks.

TL;DR: We run an LLM-backed build-failure summariser at Buildkite. To stop a provider wobble from breaking it mid-deploy, I ran a game day that fault-injected OpenAI with 429s and 500s and watched whether Bifrost's fallback config actually rerouted. It did, but only after I fixed two things I'd set up wrong. We've got a small service that reads failed CI jobs and writes a one-paragraph summary into the build annotation, so engineers don't have to scroll 4,000 lines of test log to find the one assertion that broke. It calls an LLM. Handy when it works. Embarrassing when it doesn't, because a broken annotation makes people distrust every annotation. The problem is the thing it depends on isn't ours. OpenAI rate-limits, has the occasional 5xx spell, and we don't get a heads-up. "Never had an outage" usually means you never tested the failure path. So I tested it. I didn't want fallback logic smeared across our service code. Retry-with-jitter, secondary provider, key rotation, all of that wants to live in one place with metrics attached. We put Bifrost in front, an OpenAI-compatible gateway, so our service keeps talking the same /v1/chat/completions it always did and the routing decisions move to config. The pitch is plain. One endpoint, 23+ providers behind it, automatic fallbacks between them. Our code points at localhost:8080 instead of api.openai.com and stops caring which model actually answers. Here's the fallback config I started the game day with: { "providers": { "openai": { "keys": "env.OPENAI KEY A", "env.OPENAI KEY B" }, "anthropic": { "keys": "env.ANTHROPIC KEY" } }, "fallbacks": "openai/gpt-4o-mini", "anthropic/claude-haiku-4-5" } Two OpenAI keys for load balancing, then Anthropic as the lifeboat if OpenAI as a whole goes sideways. That was the theory. A game day is just a planned outage you cause on purpose, with people watching. I scheduled 45 minutes, told the team, and put a toxiproxy in front of OpenAI so I could inject faults without waiting for the real thing to break. Three scenarios: Scenario one went fine. Bifrost saw the 429s, rotated between key A and key B, then gave up on OpenAI and the requests landed on Haiku. Annotations kept writing. Reckoned I was done. Scenario two found my first mistake. I'd not set a sane retry ceiling, so on a 503 the gateway retried hard against the same struggling provider before failing over, and our p95 on annotation writes jumped to about 18 seconds. Fixed it by capping retries and letting the fallback fire sooner. The README's retries and fallbacks https://docs.getbifrost.ai/features/retries-and-fallbacks page covers the knobs; I'd skimmed it the first time. Scenario three is the one everyone gets wrong. Slow isn't down. A 30-second response isn't an error, so naive fallback never triggers, the request just sits there. We added a request timeout so a tar-pitted provider counts as a failure and trips the lifeboat. That single change is the actual reason this exercise was worth running. Bifrost ships native Prometheus metrics, so I didn't have to bolt on my own. I watched fallback rate and per-provider latency the whole time on a Grafana board. | Scenario | Without fallback | With Bifrost tuned | |---|---|---| | 429 storm | annotations stall | reroute to Haiku, ~2.1s p95 | | Hard 503s | 50% writes fail | 0 user-visible failures | | 30s latency | every write hangs | timeout trips fallback in 4s | The numbers that mattered: zero broken annotations across all three once tuned, and the fallback decisions were visible in metrics instead of buried in logs nobody reads. I'd used LiteLLM before. Worth being honest here. | Bifrost | LiteLLM | Portkey | | |---|---|---|---| | OpenAI-compatible endpoint | yes | yes | yes | | Automatic fallbacks | yes | yes | yes | | Native Prometheus metrics | yes | yes | yes | | Self-host story | single Go binary | Python proxy | gateway is OSS, control plane hosted | | Maturity / ecosystem | newer | large, lots of integrations | polished dashboards | LiteLLM has been around longer and has a bigger pile of community integrations, which counts for something when you hit an edge case at 2am. Portkey's hosted dashboards are nicer than anything I'd build myself, and if you don't want to run infra that's a fair trade. We picked Bifrost mostly because a single Go binary is easy for an infra team to operate and the Prometheus output https://docs.getbifrost.ai/features/observability/default dropped straight into our existing board with no glue. Not a knock on the others. Different priorities. A gateway is one more hop you have to keep alive. If Bifrost falls over, every LLM call falls with it, so we run two replicas behind a load balancer and the game day included killing one of them too. Fallback to a different model means a different model. Haiku doesn't write the exact same summary as gpt-4o-mini, and for a build annotation that's fine, but if you depend on a strict output schema you need to test the lifeboat actually produces it. We caught one prompt that assumed OpenAI-specific formatting. And fault injection in front of a proxy isn't the real provider misbehaving. Toxiproxy gives you 429s and delays, not the weird partial-stream failures you see in the wild. It's a model of the failure, not the failure. Better than nothing, not the whole story. Semantic caching is on the roadmap for us, not load-bearing yet, so I'm not going to claim numbers I haven't measured.