Stop paying for idle GPUs in your CI: batching LLM eval jobs

Running LLM evaluation jobs on every pull request in CI leads to high GPU costs, with GPUs idle about 70% of the time due to cold starts and model loading. The solution involves batching eval jobs into windowed runs on shared, warm GPU pools with a priority queue that distinguishes between quick smoke tests and full regression runs, cutting GPU spend by approximately 60%. Key trade-offs include occasional idle capacity costs and added latency variance from batching.

TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs into windowed runs on shared GPU pools, plus a smarter queue that knows the difference between a "smoke test" eval and a full regression run. Here's how, and where the trade-offs hurt. Right, so a few months back I got pulled into a conversation that's becoming pretty familiar around here. A team had wired up an LLM-based evaluation suite into their CI. Every PR triggered a run against a set of prompts, scored the outputs, and posted results back to the PR. Lovely in theory. The cloud bill was not lovely. They were spinning up a g5.xlarge per PR, sometimes three or four in parallel during peak hours, and the GPU sat idle for about 70% of the run because most of the time was spent on cold starts, model loading, and prompt formatting. Classic case of treating GPUs like CPUs. I reckon a lot of teams are hitting this wall right now. So let's talk about what actually works. CI runners are designed for stateless, throwaway compute. That model breaks the second you involve a 7B+ parameter model that takes 30-90 seconds to load into VRAM. Here's the rough breakdown of a typical eval job we measured: So out of about 3 minutes of billable GPU time, you're getting 40 seconds of useful work. That's brutal economics. The trick isn't fancy. You stop spinning up a GPU per job and start treating the GPU like a long-lived service that consumes jobs from a queue. We run a small pool of g5.xlarge instances usually 2-4 depending on load that stay warm. Each runner has the model preloaded in VRAM. CI jobs push eval requests to an SQS queue, runners pull from the queue, batch up to N prompts per inference pass, and post results back. Rough sketch of the runner config: runner: instance type: g5.xlarge pool size min: 2 pool size max: 6 scale metric: queue depth scale threshold: 25 jobs in queue model: name: llama-3.1-8b-instruct preload: true keep warm seconds: 1800 batching: max batch size: 16 max wait ms: 2000 job types: smoke eval: priority: high max prompts: 10 full regression: priority: low max prompts: 500 window: nightly only That max wait ms is doing the heavy lifting. The runner waits up to 2 seconds to gather a batch before firing inference. For CI, 2 seconds of latency is nothing. For inference throughput, it's everything. Once you've got a warm pool, you might as well route different model calls through one place. We have eval suites that hit a mix of self-hosted Llama, Claude via API, and OpenAI. Instead of each CI job authenticating separately and managing keys, we put a gateway in front. There's a bunch of options here. LiteLLM is popular, Bifrost https://github.com/maximhq/bifrost is another one that does the same kind of multi-provider routing with rate limit handling, and you can roll your own with a thin FastAPI wrapper if you're feeling keen. The point is you stop scattering API keys across twenty CI configs. This was the biggest single win, honestly. We split eval jobs into tiers: Before this, every PR triggered the full 500-prompt suite because nobody had bothered to think about what they actually needed to know per PR. The answer is "did this change break something obvious?", not "is this model production-ready?" Cut our GPU-hours by about 40% just from that change alone, before any of the batching work. After about three weeks of running the new setup: Faster and cheaper, which is the dream combination you almost never get. Nothing's free, so here's what actually hurt: Cold start on scale-up is still painful. When the queue spikes past what the warm pool can handle, the new runners take 90+ seconds to come online with the model loaded. We mitigated by being more aggressive on the scale threshold than felt comfortable, which means we're occasionally paying for idle capacity. You can't have both. Batching adds latency variance. A job that arrives just after a batch fires waits the full max wait ms . For CI this is fine. For production inference it might not be, so don't blindly copy this config to your prod inference pipeline. Pool exhaustion is a real failure mode. If your queue grows faster than you can scale, jobs back up. We had a Friday afternoon where a misconfigured test suite generated 4,000 eval jobs in 10 minutes and the queue depth alert woke me up at 11pm. Add circuit breakers and per-team quotas early, not after the first incident. Model updates are now an event. When you preload models, swapping versions means a rolling restart of the pool. We do this during low-traffic windows but it's added operational overhead that didn't exist with the per-job model. No worries if your setup looks different. The general shape holds: warm pools, batched jobs, classified workloads. Apply where it fits.