{"slug": "async-llm-inference-in-ci-stop-build-workers-blocking-on-slow-jobs", "title": "Async LLM inference in CI: stop build workers blocking on slow jobs", "summary": "Buildkite engineers implemented async LLM inference through the Bifrost AI gateway to prevent build workers from blocking on slow model calls. By switching to a submit-and-poll pattern, workers no longer wait up to 35 seconds for a single LLM summarization job, freeing compute resources across hundreds of concurrent builds. The change required only two additional headers and split a blocking call into two steps, with Bifrost's OpenAI-compatible endpoint requiring no request body modifications.", "body_md": "**TL;DR:** Async inference through an AI gateway lets CI build workers submit a long LLM job, get an id back, and poll later, so a 30-second model call stops holding a worker hostage. Here's how I wired it with Bifrost.\n\nOur build workers at Buildkite were each blocked for up to 35 seconds waiting on a single LLM call that summarised failed test output. With a few hundred concurrent builds running through our compute cluster, that's a pile of expensive compute sitting idle on one synchronous request to a model provider. We moved those jobs behind [Bifrost](https://www.getmaxim.ai/bifrost), [the open-source AI gateway](https://github.com/maximhq/bifrost) by Maxim AI, and switched them to async submit-and-poll so the worker could get back to running the actual build while the summary cooked in the background.\n\nAsync inference is a request pattern where the client submits a job, gets an identifier back straight away, and polls for the result later instead of holding the connection open. With Bifrost you set `x-bf-async: true`\n\non the request and get an `x-bf-async-id`\n\nin return, then poll that id once the model has finished. The [docs overview](https://docs.getbifrost.ai/overview) covers the submit and poll lifecycle.\n\nThe win is mechanical, not magic. A worker that no longer blocks on a slow upstream can pick up the next build step. On a fleet where each agent costs real money per minute, freeing 35 seconds per build adds up fast across a few hundred concurrent runs.\n\nA build agent is a finite resource. When it makes a blocking HTTP call to an LLM and the provider takes 30-plus seconds, that agent is doing nothing but waiting on a socket. Multiply that by every failing build wanting a summary, and you've quietly turned your model provider's latency into your queue depth.\n\nWe saw exactly this. P95 latency on the summariser sat around 28 seconds, and during a flaky-test storm the build queue backed up because agents were parked on those calls. The compute was healthy; the scheduling was wrong. The fix is to decouple \"ask for a summary\" from \"wait for a summary.\"\n\nThe change was small. We added two headers and split one blocking call into a submit step and a later poll step. The Bifrost endpoint stays OpenAI-compatible, so the request body didn't change at all, which is the point of a [drop-in replacement](https://docs.getbifrost.ai/features/drop-in-replacement).\n\n```\n# Submit: fire the job, return immediately with an id\ncurl -s -X POST http://bifrost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"x-bf-async: true\" \\\n  -H \"x-bf-dim-team: build-platform\" \\\n  -d '{\n    \"model\": \"openai/gpt-4o-mini\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Summarise this failed test log...\"}]\n  }'\n# response carries x-bf-async-id: job-8f21\n\n# Poll later, from a separate build step, once the agent has moved on\ncurl -s http://bifrost:8080/v1/chat/completions \\\n  -H \"x-bf-async-id: job-8f21\"\n```\n\nWe submit at the start of the post-build hook, run the rest of the cleanup, then poll near the end. By the time we poll, the summary is usually ready, so the agent almost never waits. The `x-bf-dim-team`\n\nheader tags the request with our team name, which Bifrost auto-forwards to logs, traces, and Prometheus so we can see which team's jobs are driving spend.\n\nAsync jobs are easy to lose track of, so observability matters more, not less. With [Bifrost](https://www.getmaxim.ai/bifrost) the custom `x-bf-dim-*`\n\ndimension headers flow straight into the [observability](https://docs.getbifrost.ai/features/observability) layer, which writes asynchronously and adds under 0.1ms of overhead per the [benchmarking docs](https://docs.getbifrost.ai/benchmarking/getting-started). That let us build a Grafana panel keyed on team and job type without instrumenting our own code.\n\nFailover still applies to async jobs. We kept [automatic fallbacks](https://docs.getbifrost.ai/features/fallbacks) configured so that if our primary provider returns 502s, the gateway retries against a secondary before the job id ever comes back failed. On the throughput side, a single instance sustains 5,000 RPS at 100% success with roughly 11µs of gateway overhead on a t3.xlarge, per the [published benchmarks](https://www.getmaxim.ai/bifrost/resources/benchmarks), so the gateway itself was never the bottleneck in our queue.\n\nAsync is not free. You now own a polling loop and the job ids it depends on. If a build agent dies between submit and poll, you need those ids in durable storage or the result is orphaned. We push ids into the build's metadata so a retried step can recover them.\n\nOn the Bifrost side, self-hosting carries real operational weight. A production deployment needs Postgres backing it, which is one more stateful service for my team to run and patch. Clustering for high availability is an enterprise feature, not part of the open-source core, so a single-node deploy is a single point of failure you have to plan around. The ecosystem is also younger than LiteLLM, so there's less community Q and A when you hit an edge case. None of that was a dealbreaker for us, but plan the operational side before you commit.\n\nSwitching CI summarisation to async inference through Bifrost took the blocking time off our build agents and stopped a slow model provider from setting our queue depth. The headers are simple, the endpoint stays OpenAI-compatible, and the spend stays visible per team. If you run LLM calls inside a build fleet and your agents are parked waiting on them, async submit-and-poll is worth a look: [https://getmaxim.ai/bifrost/book-a-demo](https://getmaxim.ai/bifrost/book-a-demo)", "url": "https://wpnews.pro/news/async-llm-inference-in-ci-stop-build-workers-blocking-on-slow-jobs", "canonical_source": "https://dev.to/claire_nguyen/async-llm-inference-in-ci-stop-build-workers-blocking-on-slow-jobs-26ab", "published_at": "2026-06-25 13:21:46+00:00", "updated_at": "2026-06-25 13:43:44.732947+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["Buildkite", "Bifrost", "Maxim AI", "GPT-4o-mini", "OpenAI", "Grafana", "Prometheus"], "alternates": {"html": "https://wpnews.pro/news/async-llm-inference-in-ci-stop-build-workers-blocking-on-slow-jobs", "markdown": "https://wpnews.pro/news/async-llm-inference-in-ci-stop-build-workers-blocking-on-slow-jobs.md", "text": "https://wpnews.pro/news/async-llm-inference-in-ci-stop-build-workers-blocking-on-slow-jobs.txt", "jsonld": "https://wpnews.pro/news/async-llm-inference-in-ci-stop-build-workers-blocking-on-slow-jobs.jsonld"}}