# Google Turned LLM Load Balancing Into Scheduling. What That Means for the Rest of Us

> Source: <https://pub.towardsai.net/google-turned-llm-load-balancing-into-scheduling-what-that-means-for-the-rest-of-us-1dd30c1467b6?source=rss----98111c9905da---4>
> Published: 2026-06-25 13:31:03+00:00

Picture two LLM requests that arrive a few seconds apart. Both carry the same 2,000 token block of context: a policy, an output schema, a ranking rubric, and a few examples. The only thing that differs is the last 50 tokens, where each request includes a different customer query.

The first request lands on one replica. The model reads the shared context and builds a KV cache for it, the internal state that lets it skip recomputing tokens it has already processed. A few seconds later the second request arrives, and a normal load balancer sends it to a different replica because that replica happens to have a shorter queue. On its own terms the load balancer made a sensible choice, but from the model’s point of view it threw away work that a nearby replica had already paid for, and the second replica now has to process all 2,000 shared tokens from scratch.

Traditional load balancing treats requests as interchangeable, which holds for most stateless services and fails for a lot of LLM traffic. A request can leave behind reusable state that lives on one specific serving path, and the moment the next request is routed elsewhere, that state stops being useful.

Google’s [GKE Inference Gateway](https://cloud.google.com/blog/products/ai-machine-learning/gke-inference-gateway-and-quickstart-are-ga) is built around this observation. Rather than spreading requests as evenly as possible, it can prefer a replica that already holds useful context for the request in front of it. The broader idea is not specific to Google’s infrastructure, which is what makes it worth understanding even if you never touch GKE.

Most discussions of LLM cost focus on generated tokens, but that is only half of the serving path. Before a model produces its first token, it has to read the prompt. This is the prefill phase, and for enterprise workloads with long policies, schemas, retrieval instructions, examples, and guardrails, prefill can account for a large share of both cost and time to first token.

Take a recommendation service that attaches the same large ranking rubric to every request. The per-user portion may be tiny, but the model still has to process the entire shared prompt on every call unless something on the serving path can reuse that work.

When a model processes a prompt, it produces a KV cache holding the internal representations of the tokens it has read. If a later request begins with the same prefix and lands on a compatible serving path that still holds that cache, the system can reuse the cached prefix and process only the new portion. A round robin or least connections balancer has no visibility into this. It sees two requests and two replicas, and it does not know that one of those replicas has already done most of the work for one of the requests.

This does not matter equally everywhere. A request dominated by long generation spends most of its time in the decode phase, producing tokens one at a time, and there the savings from skipping prompt processing are smaller. The opportunity grows when requests repeatedly carry long, stable prefixes, which is common in scoring, tagging, extraction, and retrieval-heavy workloads.

Google’s GKE Inference Gateway routes with [prefix cache reuse](https://cloud.google.com/blog/products/containers-kubernetes/gke-inference-gateway-prefix-caching-accelerates-ai-inference) in mind. Instead of treating every replica as equally suitable, it can send a request to a replica that already holds a matching cached prefix, which cuts repeated prefill work for the stable parts of a prompt such as policies, schemas, examples, and system instructions.

Underneath, the gateway does not invent this logic from scratch. It delegates routing to the [llm-d Endpoint Picker](https://cloud.google.com/blog/products/containers-kubernetes/llm-d-officially-a-cncf-sandbox-project), which runs a multi-objective policy rather than a single heuristic. For each request it weighs the replica’s KV cache hit rate, the number of inflight requests, and the queue depth, then picks the backend that gives the best overall result. There is also a [predicted-latency option](https://docs.cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway) that routes using a model trained continuously on live traffic to optimize time to first token and time per output token directly.

Cache affinity on its own creates a new failure mode. If a popular prefix always routes to the one warm replica, that replica can saturate while others sit idle. The multi-objective policy is what keeps that in check, by balancing the value of a cache hit against current load instead of pinning every matching request to the same backend.

This is why the result is better described as inference scheduling than as load balancing. The gateway is choosing where to run a request based partly on the state that already exists there, not only on how busy each replica is at that instant.

The [production numbers Google reported](https://cloud.google.com/blog/products/containers-kubernetes/how-gke-inference-gateway-improved-latency-for-vertex-ai) make the split between these forces concrete, and they are more specific than a single headline figure suggests. For a context-heavy coding workload on Qwen3-Coder, time to first token dropped by more than 35 percent. For bursty, unpredictable chat traffic on DeepSeek V3.1, P95 time to first token improved by about 2x, or 52 percent, as the gateway absorbed the load spikes. Separately, the prefix cache hit rate doubled from 35 percent to 70 percent. That last gain is the interesting one, because it did not come from turning on cache-aware routing. Cache-aware routing was already running. The hit rate doubled after Google added admission control and fairness at the ingress layer, which smoothed traffic distribution and cleared the hotspots that cache affinity had been creating. The cache improvement, in other words, came from balancing affinity against load, which is the same tradeoff the rest of this article keeps returning to.

A useful way to think about LLM architecture is to route at the boundary of shared state. The unit at which you make a routing decision should line up with the unit of state you want to keep alive. If two requests share nothing meaningful, you can route them independently and lose nothing. If they share a large prompt prefix, a conversation history, or workflow context, then moving one of them to a different path can throw away work that already exists.

Google’s gateway works mainly at the inference scheduling layer, deciding where a request runs based partly on reusable cache state. The layers increasingly influence each other, though. A model routing choice can change what cache is reusable, replica placement changes latency, and workflow ordering can decide whether expensive context gets reused or rebuilt over and over.

Recent research pushes this further. [SAGA](https://arxiv.org/abs/2605.00528) treats an agent workflow as a single scheduling problem rather than a sequence of independent calls, on the grounds that the best placement for one step depends on the context, dependencies, and latency constraints across the whole workflow. The payoff it reports is concrete: in their setup a request-level baseline spent 38 percent of its time regenerating KV cache between agent steps, and workflow-aware scheduling cut that to 8 percent.

The right policy depends on what state the workload actually shares. Here is how I would think about the common cases.

A quick classification, a lightweight extraction, or a one-off safety check carries almost no reusable state. There is usually no long prompt, no conversation history, and no real dependency on the previous request. For this kind of traffic, routing per request is fine, and you can pick a model on capability, latency, and cost without giving much up. This is also the cleanest place to send easy requests to a smaller or cheaper model and reserve the stronger models for the hard ones.

A lot of enterprise workloads live here. A scoring service, a tagging pipeline, a recommendation engine, or a structured extraction job handles each user request on its own while still carrying the same long policy, schema, examples, and rules on every call. The requests are independent at the user level but not at the serving level, and that gap is where the cost hides.

Two choices matter most. The first is prompt structure. Stable, reusable content should come first and the dynamic, per-request content last. Putting a timestamp, request ID, or user-specific detail near the top changes the prefix and quietly destroys reuse, even though the bulk of the prompt is identical. The second is routing stability. Cache state is tied to a model and a serving path, so unnecessary switching fragments it. For prefix-heavy paths, I would weigh the savings from dynamic model routing against the prefill cost, retry risk, and latency hit you take whenever a warm path goes cold.

In a conversation, the shared state is the thread itself. Switching models late in a long session can force the new path to reprocess the entire history, and the longer the thread, the bigger that prefill penalty gets. There is a product cost too. Users experience one assistant, not a series of routing decisions, so a mid-conversation switch can surface as a change in tone, reasoning depth, formatting, or tool use.

My default here is model stickiness within a session, with a short list of explicit reasons to break it: a safety threshold that requires a different model or policy path, a task that exceeds the current model’s capability, repeated schema or tool-call failures, a latency target the current path cannot meet, or a topic shift large enough that the earlier context no longer helps. The distinction I care about is deliberate escalation versus constant re-selection. A switch should be a named decision, not an accident of optimizing one request in isolation.

Multi-agent workflows look like conversations, but the routing boundary sits somewhere else. A planning agent, an extraction agent, a retrieval agent, and a validation agent can reasonably run on different models, because each one owns a distinct role and context. Using a strong model for planning and a smaller, faster one for extraction is a sensible specialization boundary, not harmful switching.

The cost shows up when adjacent agents keep reprocessing the same large context. If two back-to-back steps both need the same policy, source documents, or structured context, the workflow can pay prefill for it twice across separate paths. There is also a critical-path question, because the cheapest model per token is not always the cheapest model for the workflow as a whole. A slow model on the step that blocks the final answer can add more end-to-end latency than its token savings are worth. So I would route each agent by its task but schedule the workflow as a unit, paying attention to which steps share context, which can run in parallel, and which sit on the user-visible critical path.

None of this is a reason to make every path sticky. A stable prompt does not guarantee that a managed provider will reuse it, because cache behavior depends on the model, region, deployment, provider implementation, token threshold, time to live, and traffic pattern. And as Google’s own hotspot example shows, too much traffic sharing one prefix can overload a single replica, which is exactly why the gateway balances cache matching against queue depth and load instead of pinning everything to the warmest backend.

Preserving state is a design preference, not an absolute rule. A cold switch can still be the right call when quality, safety, capacity, or latency outweigh the value of reuse. What matters is making that tradeoff explicit rather than letting it happen by accident.

Before you add a routing policy, it helps to ask one thing:

What state has this request path already paid to create, and what gets lost if the next call lands somewhere else?

For a short stateless classification, the honest answer is close to nothing. For a prefix-heavy service, a long conversation, or a multi-step agent workflow, it can be a meaningful share of the latency and cost you were trying to reduce in the first place.

Google’s gateway is a fleet-level version of this idea, and most teams will not be making routing decisions at that level. The underlying habits still carry over. Keep stable prompt prefixes stable, avoid switching models on stateful paths without a reason, and treat workflow placement as a scheduling problem whenever the calls depend on each other.

The model router is becoming one piece of a larger serving control plane. Choosing which model answers still matters, but it now sits next to questions about where the request runs, what state already exists there, and whether moving it is worth the work you would throw away.

[Google Turned LLM Load Balancing Into Scheduling. What That Means for the Rest of Us](https://pub.towardsai.net/google-turned-llm-load-balancing-into-scheduling-what-that-means-for-the-rest-of-us-1dd30c1467b6) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.