Prefill/Decode Disaggregation: Why Your GPU Can’t Do Two Things at Once

wpnews.pro

I’ve been digging into the LLM inference space since I took my ML systems course at Carnegie Mellon University. Modern LLM inference has a weird systems problem. The first token and every token after it want completely different hardware behavior. That sounds strange at first. After all, it is the same model generating the same response.

But under the hood, those two stages behave so differently that forcing them onto the same GPUs creates a surprisingly expensive compromise.

Prefill decode disaggregation exists because of that compromise.

The autoregressive generation during inference occurs in two phases:

Prefill phase: the model processes all your input tokens at once. If you send a 1000 token prompt, the model runs one big forward pass over all 1000 tokens simultaneously, building up the KV cache. This is the first iteration.

Decoding phase: the model generates one token at a time, using the KV cache from all previous tokens. This phase repeats for every token you generate.

These two phases feel similar but they’re fundamentally different workloads. And that difference is what this whole post is about.

You know that feeling when you ask ChatGPT something and just wait. The cursor blinks, nothing happens and then suddenly the response starts streaming.

That wait before the first word appears is TTFT (Time To First Token). It’s the total time taken by the prefill step. The speed at which words stream after that is TPOT (Time Per Output Token). TPOT is determined by the speed of the decode step. You can also think of this as the latency between two output tokens.

A slow TTFT means the product feels broken, users think nothing is happening. A slow TPOT means the response feels sluggish to read. Both hurt user experience but in completely different ways and they require completely different optimizations.

The major problem is that on the same GPU, optimizing for one hurts the other.

This table already hints at the problem. These phases do not just differ slightly. They want fundamentally different GPU behavior.

Prefill is compute-bound. During prefill, the model processes the entire prompt at once. For example, if your prompt has 1,000 tokens, all 1,000 tokens flow through every transformer layer together in large matrix multiplications. This keeps the GPU’s tensor cores busy doing heavy compute, so the bottleneck is mostly computation, not memory movement.

Because there’s so much compute happening, techniques like tensor parallelism work well here, the communication overhead (like synchronizing results across GPUs) is relatively small compared to the large amount of computation.

Decode is memory-bound. After prefill, generation becomes autoregressive: the model produces one token at a time. Suppose the model generates the next word, say “cat”. To produce just that single token, every transformer layer still has to run once.

That means for each decode step, the GPU repeatedly does something like:

load layer weights from HBM → compute → move to next layer → repeat

Even though the compute for one token is small, the model weights are still massive (billions of parameters), so a lot of time is spent just fetching weights from HBM (High Bandwidth Memory). The GPU spends more time waiting for data than doing the math.

The same GPU has two very different bottlenecks. And because the bottlenecks are different, the way you parallelize each phase across multiple GPUs is also completely different. What works great for prefill actively hurts decode and vice versa. That’s the core tension and to understand it properly we need to talk about the three main ways you can split an LLM across GPUs

When you scale LLM inference across multiple GPUs you have three main strategies. Each one works great for one phase and poorly for the other. Let’s break them down.

Imagine you have a weight matrix W that’s 8192 × 8192. That’s 67 million numbers. In tensor parallelism you split this matrix across GPUs. GPU 1 gets the left half, GPU 2 gets the right half.

When a forward pass runs, each GPU multiplies the input by its shard of the weight matrix simultaneously. Then they communicate via an all-reduce operation where every GPU broadcasts its partial result to every other GPU and they sum them up to get the final answer. Then the next layer runs the same way.

The communication happens at every single layer. For a 70B model with 80 layers that’s 80 all-reduce operations per forward pass.

Why does this work well for prefill and not for decode? For prefill you’re processing 1000 tokens at once. Each all-reduce is syncing results for 1000 tokens worth of computation. The compute is so large that the communication overhead i.e. the time spent broadcasting and summing across GPUs is a small fraction of total time (worth it).

For decode you’re generating one token. Each all-reduce is syncing results for one token worth of computation. The compute is tiny but you’re still paying the full communication cost 80 times per token. It’s like organizing a 4 person video call to share a single sentence. The overhead of setting up the call is bigger than the actual conversation. Instead of splitting weight matrices, pipeline parallelism splits the model layers. If you have a 80 layer model and 4 GPUs, GPU 1 handles layers 1–20, GPU 2 handles layers 21–40, GPU 3 handles layers 41–60, GPU 4 handles layers 61–80.

Think of it like a factory assembly line. The input goes into GPU 1, gets processed through layers 1–20, then the output (called activations) gets passed to GPU 2 which processes layers 21–40, and so on until GPU 4 produces the final output.

The communication here is much simpler than tensor parallelism as you’re just passing a single activation tensor from one GPU to the next. No all-reduce, no broadcasting to every GPU, it’s just GPU 1 handing off to GPU 2.

This sounds great but pipeline parallelism introduces bubble overhead. Unless the pipeline is fully saturated with enough concurrent work, some GPUs remain underutilized while others are active.

For prefill you can fill these bubbles by processing multiple requests in a pipeline, while GPU 2 is processing request 1, GPU 1 starts processing request 2. The pipeline stays full. But for decode where you’re generating one token at a time for each request, keeping the pipeline full is much harder. Data parallelism is the simplest idea of the three. Don’t split the model at all. Just run full copies of it on different GPUs and give each copy different requests to handle.

GPU 1 has the complete 70B model and handles users 1–100. GPU 2 has a complete copy and handles users 101–200. They never need to talk to each other. No all-reduce, no activation passing, no pipeline bubbles. Each GPU is fully independent.

The tradeoff is memory. A 70B model in fp16 is 140GB. Each GPU needs 140GB just to hold the model weights (expensive!).

There’s another issue beyond raw hardware efficiency: scheduling.

Imagine two requests arrive at the same time.

Request A has a 200k token prompt and needs a massive prefill pass.

Request B is a normal chat request already in decode and just needs the next token.

If both share the same GPU pool, Request A can delay Request B even though their workloads are completely different. That creates head of line blocking. A long prompt heavy request can slow down latency sensitive token generation for everyone else. So the problem isn’t just compute versus memory. It’s also interference between different classes of traffic. That makes the case for disaggregation even stronger.

At this point, we know that tensor parallelism, pipeline parallelism, and data parallelism all have different trade-offs, but they all make the same assumption:

Prefill and decode run on the same GPUs (and that’s the real problem). So, what if prefill and decode didn’t run on the same GPUs at all? That’s where disaggregation comes in.

Instead of one shared GPU pool, you create two separate GPU pools, each optimized for one job. Prefill GPUs handle incoming prompts and build the KV cache using aggressive batching and tensor parallelism. Decode GPUs focus purely on token generation and are tuned for memory bandwidth and concurrency.

Once prefill finishes, the prefill GPU simply ships the KV cache over the network to a decode GPU, which takes over generation. The request flow now looks like this:

Prompt -> Prefill Cluster -> KV Cache Transfer -> Decode Cluster -> Output Tokens Now each cluster can use the parallelism strategy that actually works best for its workload (no more compromise).

The Splitwise paper came out in 2023 and by 2024 many modern serving systems were already adopting it.

Disaggregation isn’t free though :/

When prefill finishes, the KV cache needs to physically move from the prefill GPU to the decode GPU over the network. So, for long prompts on large models, KV cache size can easily reach hundreds of MB or even multiple GB. That transfer shows up as a gap between prefill finishing and the first token streaming which directly hurts TTFT, the exact thing disaggregation was supposed to fix.

So how do you deal with this? A few approaches:

In practice the transfer cost is worth it, the gains from specialized GPU pools far outweigh the network overhead. But it requires real engineering to get right and it’s an active area of research.

Disaggregation went from a research paper to production infrastructure embarrassingly fast. The Splitwise paper came out in 2023 and by 2024, since then many modern serving systems are either adopting or actively experimenting with disaggregation.

SGLang has disaggregation integration and vLLM is adding disaggregation support. The major cloud providers are building their own versions internally. And Mooncake, the serving system behind Kimi (one of the largest LLM deployments in China), is built entirely around disaggregated prefill and decode.

The reason it spread so fast is actually pretty intuitive once you think about it. As context windows got longer (4k to 32k to 128k to 1M tokens) prefill got proportionally more expensive. The interference problem that was mildly annoying at 4k context becomes completely painful at 128k context. Disaggregation went from a nice optimization to a hard requirement almost overnight.

And that’s the thing about disaggregation. It sounds almost too obvious once you see it. Prefill and decode are not the same workload, they never were, and yet we kept forcing them onto the same GPUs and wondering why latency was unpredictable.

The solution is to have two pools, two jobs, each optimized for exactly what it needs. The KV cache transfer cost is real but manageable. The scheduling isolation alone is worth it.

If you’ve been following this series: KV cache was about memory, speculative decoding was about speed, continuous batching was about throughput. This one is about realizing the architecture itself was the bottleneck all along. Prefill/Decode Disaggregation: Why Your GPU Can’t Do Two Things at Once was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article The 3B Model Going Toe to Toe with Opus 4.5 In Maths and Coding Substrate-Bound Coupling in Human-LLM Interaction LAI #131: A Tool Call Can Succeed and Still Be the Wrong Tool

Prefill/Decode Disaggregation: Why Your GPU Can’t Do Two Things at Once

Run your AI side-project on zahid.host