Fault-injecting our LLM provider to trust Bifrost fallbacks

wpnews.pro

cd /news/large-language-models/fault-injecting-our-llm-provider-to-… · home › topics › large-language-models › article

[ARTICLE · art-33997] src=dev.to ↗ pub=2026-06-19T13:26Z topic=large-language-models verified=true sentiment=↑ positive

Fault-injecting our LLM provider to trust Bifrost fallbacks

Buildkite ran a game day that fault-injected OpenAI with 429s and 500s to test whether Bifrost's fallback config would reroute requests for an LLM-backed build-failure summariser. After fixing a retry ceiling and adding a request timeout, the gateway successfully rerouted to Anthropic's Claude Haiku, preventing any user-visible failures. The exercise demonstrated that slow responses, not just errors, must be treated as failures to trigger fallbacks.

read5 min views2 publishedJun 19, 2026

TL;DR: We run an LLM-backed build-failure summariser at Buildkite. To stop a provider wobble from breaking it mid-deploy, I ran a game day that fault-injected OpenAI with 429s and 500s and watched whether Bifrost's fallback config actually rerouted. It did, but only after I fixed two things I'd set up wrong.

We've got a small service that reads failed CI jobs and writes a one-paragraph summary into the build annotation, so engineers don't have to scroll 4,000 lines of test log to find the one assertion that broke. It calls an LLM. Handy when it works. Embarrassing when it doesn't, because a broken annotation makes people distrust every annotation.

The problem is the thing it depends on isn't ours. OpenAI rate-limits, has the occasional 5xx spell, and we don't get a heads-up. "Never had an outage" usually means you never tested the failure path. So I tested it.

I didn't want fallback logic smeared across our service code. Retry-with-jitter, secondary provider, key rotation, all of that wants to live in one place with metrics attached. We put Bifrost in front, an OpenAI-compatible gateway, so our service keeps talking the same /v1/chat/completions

it always did and the routing decisions move to config.

The pitch is plain. One endpoint, 23+ providers behind it, automatic fallbacks between them. Our code points at localhost:8080

instead of api.openai.com

and stops caring which model actually answers.

Here's the fallback config I started the game day with:

{
  "providers": {
    "openai": { "keys": ["env.OPENAI_KEY_A", "env.OPENAI_KEY_B"] },
    "anthropic": { "keys": ["env.ANTHROPIC_KEY"] }
  },
  "fallbacks": [
    "openai/gpt-4o-mini",
    "anthropic/claude-haiku-4-5"
  ]
}

Two OpenAI keys for load balancing, then Anthropic as the lifeboat if OpenAI as a whole goes sideways. That was the theory.

A game day is just a planned outage you cause on purpose, with people watching. I scheduled 45 minutes, told the team, and put a toxiproxy in front of OpenAI so I could inject faults without waiting for the real thing to break.

Three scenarios:

Scenario one went fine. Bifrost saw the 429s, rotated between key A and key B, then gave up on OpenAI and the requests landed on Haiku. Annotations kept writing. Reckoned I was done.

Scenario two found my first mistake. I'd not set a sane retry ceiling, so on a 503 the gateway retried hard against the same struggling provider before failing over, and our p95 on annotation writes jumped to about 18 seconds. Fixed it by capping retries and letting the fallback fire sooner. The README's retries and fallbacks page covers the knobs; I'd skimmed it the first time.

Scenario three is the one everyone gets wrong. Slow isn't down. A 30-second response isn't an error, so naive fallback never triggers, the request just sits there. We added a request timeout so a tar-pitted provider counts as a failure and trips the lifeboat. That single change is the actual reason this exercise was worth running.

Bifrost ships native Prometheus metrics, so I didn't have to bolt on my own. I watched fallback rate and per-provider latency the whole time on a Grafana board.

Scenario	Without fallback	With Bifrost (tuned)
429 storm	annotations stall	reroute to Haiku, ~2.1s p95
Hard 503s	50% writes fail	0 user-visible failures
30s latency	every write hangs	timeout trips fallback in 4s

The numbers that mattered: zero broken annotations across all three once tuned, and the fallback decisions were visible in metrics instead of buried in logs nobody reads.

I'd used LiteLLM before. Worth being honest here.

Bifrost	LiteLLM	Portkey
OpenAI-compatible endpoint	yes	yes	yes
Automatic fallbacks	yes	yes	yes
Native Prometheus metrics	yes	yes	yes
Self-host story	single Go binary	Python proxy	gateway is OSS, control plane hosted
Maturity / ecosystem	newer	large, lots of integrations	polished dashboards

LiteLLM has been around longer and has a bigger pile of community integrations, which counts for something when you hit an edge case at 2am. Portkey's hosted dashboards are nicer than anything I'd build myself, and if you don't want to run infra that's a fair trade. We picked Bifrost mostly because a single Go binary is easy for an infra team to operate and the Prometheus output dropped straight into our existing board with no glue. Not a knock on the others. Different priorities.

A gateway is one more hop you have to keep alive. If Bifrost falls over, every LLM call falls with it, so we run two replicas behind a load balancer and the game day included killing one of them too.

Fallback to a different model means a different model. Haiku doesn't write the exact same summary as gpt-4o-mini, and for a build annotation that's fine, but if you depend on a strict output schema you need to test the lifeboat actually produces it. We caught one prompt that assumed OpenAI-specific formatting.

And fault injection in front of a proxy isn't the real provider misbehaving. Toxiproxy gives you 429s and delays, not the weird partial-stream failures you see in the wild. It's a model of the failure, not the failure. Better than nothing, not the whole story.

Semantic caching is on the roadmap for us, not load-bearing yet, so I'm not going to claim numbers I haven't measured.

source & further reading

dev.to — original article #1.—Use AI as a calculator. Summer Solstice Is Tangled: The Final Knot Reading the web with half-understood words everywhere

~/api · this article 200

$curl api.wpnews.pro/v1/news/fault-injecting-our-llm-…

Read original on dev.to → dev.to/claire_nguyen/fault-injecting-our-llm-pro…

mentioned entities

Buildkite

OpenAI

Bifrost

Anthropic

Claude Haiku

Grafana

Prometheus

LiteLLM

metadata

slugfault-injecting-our-llm-provider-to-trust-bifrost-fallbacks

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevWe Just Open-Sourced the Fastest…

next →I had an AI grade five real pitc…

── more in #large-language-models 4 stories · sorted by recency

letsdatascience.com · 19 Jun · #large-language-models

Server-Side Tools Reshape AI Agent Architecture and Latency

dev.to · 19 Jun · #large-language-models

Reading the web with half-understood words everywhere

dev.to · 19 Jun · #large-language-models

I Can't Tell If the Model Matters

cryptobriefing.com · 19 Jun · #large-language-models

Databricks opts for private funding over IPO amid market lull

── more on @buildkite 3 stories trending now

wpnews · 18 Jun · #ai-chips

Apple and Intel join forces in Trump’s push to bring chipmaking home

wpnews · 18 Jun · #ai-agents

How to Automate Business Reports With an AI Agent Instead of Dashboards

wpnews · 18 Jun · #large-language-models

ICYMI: ZAI launches GLM-5.2 open model with 1M context

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required