In-flight batching is the trick that keeps LLM serving from wasting GPU seats.
I used to think batching requests for a machine learning model was a solved problem. I have hosted models and served requests. Batching speeds up serving requests. If you don't batch, for example, send one tiny request at a time, the GPU behaves like a giant bus carrying one passenger. It can move, but the economics are terrible. This is why batching exists. But LLMs make batching weird.
The Idea In a traditional web backend or a standard computer vision pipeline, it's straightforward. You put requests into a queue, wait until you hit a batch size of 4 or 8, slam them into the GPU, and return the results. Standard engineering. A single trip through the model.
An image classifier does this. An embedding model does this. You pass in data, the model runs, and you get the result. But LLM serving completely breaks this mental model. LLM generation is iterative. It generates one token at a time.
This means serving an LLM is not just "run the model once." It is a scheduling problem that repeats every token. If you treat LLM requests like traditional web requests, your GPU efficiency plummets, your latency spikes, and your cloud bill skyrockets.
LLM serving is a loop. Every token is another chance to waste the GPU or fill it.
Prefill, Decode, and the KV Cache Each request has two phases. The first phase is prefill, where the model reads the prompt and builds the internal attention state. The second phase is decode, which uses autoregressive decoding to generate text one token at a time, feeding each generated token back into the model to predict the next.
Because decode steps run repeatedly until an end condition is met, requests vary wildly in duration. To avoid recomputing the prompt history at every step, the server maintains a KV cache in GPU memory. The scheduler's goal is to keep the GPU busy with token generation without running out of this finite cache memory.
Static Batching Imagine three requests arrive together.
Static batching puts them on the same bus and makes the bus finish the whole trip before taking new passengers. Even though B and C finished early, their seats cannot be reassigned. The GPU keeps running, but it wastes memory and computes empty padding tokens. That is the waste.
The Fixed Tour Bus vs. The Dynamic City Transit Bus I like thinking about the difference between static and in-flight batching as the difference between a pre-booked tour bus and a public city transit bus:
The Fixed Tour Bus (Static Batching): A tour bus leaves the station with a set passenger list. Even if a passenger decides to get off early at stop 2, their seat must remain empty for the rest of the trip. The bus cannot pick up new passengers on the road. Instead, it must complete the entire tour and return to the station before a new group.The Dynamic City Transit Bus (In-Flight Batching): A transit bus that runs a continuous loop. As soon as a passenger reaches their destination and steps off, the bus s briefly at the next stop, lets a new passenger board to fill the empty seat, and immediately continues its journey.
In LLM serving, the bus is the active batch. A seat is not just a "batch slot." It represents GPU memory and KV cache capacity. Getting off means a request has hit an end condition. Boarding means a new request has enough memory budget to join the active generation loop.
LLM serving intuition
The bus that changes passengers while moving
Each seat is a batch slot backed by GPU memory and KV cache. Each tick is one generation iteration.
The important part is that the batch is no longer a fixed group of requests. It is a dynamic, fluid variable.
In-flight Batching In-flight batching is also called continuous batching or iteration-level batching. So how do engines like vLLM or TensorRT-LLM actually implement this "dynamic bus" in code? They shift the scheduling boundary from the request level to the iteration level.
At every generation iteration, the scheduler asks:
- Which requests are still active?
- Which requests just finished?
- Which new requests are waiting?
- Is there enough KV cache space?
- Can we add new work without hurting latency too much?
Instead of waiting for a batch of requests to finish entirely before running the next batch, the execution engine runs a single forward pass of the transformer model (generating exactly one token for all active requests), s for a microsecond to look at the queue, and rebuilds the batch for the very next token.
Of course production systems are more complicated. They deal with priorities, timeouts, chunked prefill, speculative decoding, multiple GPUs, tensor parallelism, and fairness. But this loop is the shape of the idea.
Why It Matters The naive way to improve throughput is to make the batch larger. That works until it does not. Bigger batches can increase throughput, but they can also make users wait longer before the first token. In a chatbot, the first token matters a lot. A user can tolerate a long answer streaming over time, but waiting too long before anything appears feels broken. So LLM serving has a slightly different objective than plain inference:
- maximize tokens/sec
- without destroying time to first token
- without wasting KV cache
- without letting long requests block short ones
In-flight batching resolves these constraints by keeping the active batch dense with useful work on every iteration. By making the batch dynamic, we avoid the idle slots and padding of static batching. The LLM itself gets all the attention, but the scheduler is what makes it economically viable.