I was going over my LLM inference notes from Carnegie Mellon University today and I thought this is one of the important topics I should write about. So here we go 🙂
If you’ve been following along, we’ve talked about the KV cache and speculative decoding. This one is about throughput, how do you serve hundreds of users efficiently at the same time without your GPU sitting idle. Before we get into continuous batching, let’s make sure we’re on the same page about how inference actually works.
The autoregressive generation during inference that we discussed in the previous blog occurs in two phases:
Batching just means running multiple input requests through these phases at the same time. If you’re running an LLM API and 8 users send a request at the same second, you don’t want to process them one by one. That would be painfully slow and wasteful. Instead you group them together into a batch and run them through the model simultaneously. The GPU handles all of them in parallel, which is way more efficient than sequential processing.
The simplest way to do this is static batching. You take the 8 requests, group them together, and process the entire batch from start to finish before accepting any new requests. The GPU pre-allocates memory for all requests in the batch upfront and everyone runs together until every single request completes.
Static batching has one painful flaw. Imagine you batch 8 requests together. Request 1 needs 10 tokens. Request 8 needs 500 tokens. Request 1 finishes at iteration 10, but that GPU slot cannot be freed. It just sits there, empty, for the next 490 iterations while request 8 finishes. Multiply this across all 8 requests and you’re looking at the majority of your GPU doing nothing most of the time. This is called the straggler problem. The slowest request sets the pace for everyone. And since you have absolutely no idea upfront how long each request will take, output length is unpredictable and you end up with massive GPU underutilization. Naive serving systems top out at 20–30% GPU utilization even under heavy load (the rest is just idle!!!).
Think of it like a restaurant where the entire table has to wait for the slowest eater before anyone new can sit down. Meanwhile half the chairs are empty.
To overcome this underutilization of GPUs and the painful wait time, continuous batching was introduced. The idea was to re-evaluate the batch at every single forward pass (every iteration), instead of locking the batch for the entire duration of every request .
The moment a request finishes, that slot is immediately freed and a new request from the queue jumps in at the next iteration. No waiting. No idle slots. The batch composition changes dynamically at every step: the scheduler checks which requests finished (hit their end-of-sequence token) and whether there is enough free KV cache memory to bring in a new request. If both conditions are met, the new request joins immediately at the next iteration.
Let’s make it concrete. Say you have 8 slots and 20 requests waiting in the queue.
With static batching you fill all 8 slots and wait for all 8 to finish. Request 1 finishes at iteration 10 but its slot just sits there until request 8 finishes at iteration 500. You’re burning 490 iterations worth of GPU time on an empty slot. If 6 requests finish early you’re running at 25% GPU utilization while you wait for the last 2. Painful.
With continuous batching the moment slot 3 finishes at iteration 47, a new request slides in at iteration 48. The GPU never sees an idle slot. All 20 requests get processed faster because the GPU stays full the entire time.
The Orca paper which introduced this idea in 2022 showed throughput improvements of up to 36x over static batching (Orca paper). Just from re-evaluating the batch at every iteration instead of waiting for everyone to finish. Real world gains, however, depend heavily on your workload, request length distribution, model size, hardware. But the directional improvement is significant across the board.
If you read my KV cache post you already know where this is going (if you haven’t, click here). Every request in the continuous batch needs its own KV cache blocks. And here’s the thing, these blocks grow at every decoding iteration as the request generates more tokens. So you have a batch where requests are constantly joining, leaving, and growing their KV cache simultaneously (yep, a memory management nightmare).
Without PagedAttention you’d need to pre-allocate a big contiguous chunk of memory for every request upfront, worst case output length, just in case. But you have no idea how long each request will be. So you either over-allocate and waste memory, or under-allocate and crash. Neither of them is great. And if you’re running out of memory you can’t keep the batch full, which defeats the entire point of continuous batching.
PagedAttention fixes this by allocating KV cache in small fixed-size blocks on demand: no upfront reservation, no contiguous memory requirement. As a request generates more tokens it just gets handed a new block. When it finishes those blocks are immediately freed for the next request.
This is why PagedAttention and continuous batching are designed to work together. Continuous batching manages when requests enter and leave the batch. PagedAttention manages where their KV cache lives in memory. One handles time, the other handles space. Without both working together, neither works as well.
Continuous batching is not free though. There’s a real tradeoff here that’s worth understanding before we just crank up the batch size and call it a day.
To get us started, imagine if we have 8 slots and we’re running 8 requests simultaneously, each request is competing for GPU cycles with 7 others. That means each individual request takes longer to complete than if it had the GPU all to itself. Higher batch size = higher throughput (more tokens processed per second across all users) but higher latency per individual request.
Go the other way and run a batch size of 2, each request gets way more GPU attention and finishes faster, but you’re leaving 6 slots empty. Lower latency, terrible throughput. And at very small batch sizes you can hit a different problem, the GPU is so underutilized that you’re not even stressing the memory bandwidth, which means you’re leaving performance on the table in a completely different way. There’s a sweet spot and most production systems tune for it empirically.
So which do you optimize for? It depends entirely on your use case. If you’re building a chatbot serving thousands of users, throughput wins as individual users can tolerate a slightly slower response if it means the system doesn’t fall over under load. If you’re building a latency-critical application like real-time code completion, you might sacrifice some throughput to keep individual responses snappy.
Most production systems today expose this as a tunable parameter. You set your target batch size based on your latency budget and let the scheduler handle the rest.
Continuous batching is table stakes in every serious inference system today. vLLM has it, TensorRT-LLM has it, and SGLang has it. The original idea came from the Orca paper in 2022 and within a year it was in every major serving framework. If you’re running LLM inference in production without continuous batching you are genuinely leaving most of your GPU on the floor, the difference between 20–30% utilization and close to 100%.
And that’s continuous batching. Honestly one of those ideas that makes you go ‘why wasn’t this always the case’ once you understand it. Just re-evaluate the batch at every iteration, swap finished requests out immediately, keep the GPU full. Simple in hindsight, but it changed how every serious inference system is built today.
The cool part is that none of these three things work in isolation. PagedAttention keeps memory efficient so you can actually keep the batch full, without it you’d run out of GPU memory before continuous batching could do its job. Continuous batching keeps the GPU busy so PagedAttention’s memory savings actually translate to real throughput gains. And speculative decoding sits on top of all that, squeezing more tokens out of every forward pass by using a draft model to guess ahead. Remove any one of these and the other two are less effective. Together they’re why modern serving systems can handle massive concurrent load without burning through GPU budget
If you’ve been following along, we’ve now covered the three pillars of LLM inference: memory (KV cache), speed (speculative decoding), and throughput (this one). That’s the foundation that every major inference systems like vLLM, SGLang, TensorRT-LLM are built on. Next up I want to dig into how all of this comes together inside a real serving system. Stay tuned
Continuous Batching: How to Keep Your GPU Actually Busy was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.