{"slug": "prefill-decode-disaggregation-why-your-gpu-cant-do-two-things-at-once", "title": "Prefill/Decode Disaggregation: Why Your GPU Can’t Do Two Things at Once", "summary": "Prefill/decode disaggregation separates the two phases of LLM inference—prefill (compute-bound) and decode (memory-bound)—onto different GPUs to avoid the performance compromise of running both on the same hardware. The prefill phase processes all input tokens at once, while the decode phase generates one token at a time, each requiring different GPU behavior and parallelization strategies.", "body_md": "I’ve been digging into the LLM inference space since I took my ML systems course at Carnegie Mellon University. Modern LLM inference has a weird systems problem. The first token and every token after it want completely different hardware behavior. That sounds strange at first. After all, it is the same model generating the same response.\n\nBut under the hood, those two stages behave so differently that forcing them onto the same GPUs creates a surprisingly expensive compromise.\n\nPrefill decode disaggregation exists because of that compromise.\n\nThe autoregressive generation during inference occurs in two phases:\n\n**Prefill phase:** the model processes all your input tokens at once. If you send a 1000 token prompt, the model runs one big forward pass over all 1000 tokens simultaneously, building up the KV cache. This is the first iteration.\n\n**Decoding phase**: the model generates one token at a time, using the KV cache from all previous tokens. This phase repeats for every token you generate.\n\nThese two phases feel similar but they’re fundamentally different workloads. And that difference is what this whole post is about.\n\nYou know that feeling when you ask ChatGPT something and just wait. The cursor blinks, nothing happens and then suddenly the response starts streaming.\n\nThat wait before the first word appears is TTFT (Time To First Token). It’s the total time taken by the prefill step. The speed at which words stream after that is TPOT (Time Per Output Token). TPOT is determined by the speed of the decode step. You can also think of this as the latency between two output tokens.\n\nA slow TTFT means the product feels broken, users think nothing is happening. A slow TPOT means the response feels sluggish to read. Both hurt user experience but in completely different ways and they require completely different optimizations.\n\nThe major problem is that on the same GPU, optimizing for one hurts the other.\n\nThis table already hints at the problem. These phases do not just differ slightly. They want fundamentally different GPU behavior.\n\n**Prefill is compute-bound.** During prefill, the model processes the entire prompt at once. For example, if your prompt has **1,000 tokens**, all 1,000 tokens flow through every transformer layer together in large matrix multiplications. This keeps the GPU’s tensor cores busy doing heavy compute, so the bottleneck is mostly **computation**, not memory movement.\n\nBecause there’s so much compute happening, techniques like tensor parallelism work well here, the communication overhead (like synchronizing results across GPUs) is relatively small compared to the large amount of computation.\n\n**Decode is memory-bound.** After prefill, generation becomes autoregressive: the model produces **one token at a time**. Suppose the model generates the next word, say *“cat”*. To produce just that single token, every transformer layer still has to run once.\n\nThat means for each decode step, the GPU repeatedly does something like:\n\n**load layer weights from HBM → compute → move to next layer → repeat**\n\nEven though the compute for one token is small, the model weights are still massive (billions of parameters), so a lot of time is spent just fetching weights from **HBM (High Bandwidth Memory)**. The GPU spends more time waiting for data than doing the math.\n\nThe same GPU has two very different bottlenecks. And because the bottlenecks are different, the way you parallelize each phase across multiple GPUs is also completely different. What works great for prefill actively hurts decode and vice versa. That’s the core tension and to understand it properly we need to talk about the three main ways you can split an LLM across GPUs\n\nWhen you scale LLM inference across multiple GPUs you have three main strategies. Each one works great for one phase and poorly for the other. Let’s break them down.\n\nImagine you have a weight matrix W that’s 8192 × 8192. That’s 67 million numbers. In tensor parallelism you split this matrix across GPUs. GPU 1 gets the left half, GPU 2 gets the right half.\n\nWhen a forward pass runs, each GPU multiplies the input by its shard of the weight matrix simultaneously. Then they communicate via an all-reduce operation where every GPU broadcasts its partial result to every other GPU and they sum them up to get the final answer. Then the next layer runs the same way.\n\nThe communication happens at every single layer. For a 70B model with 80 layers that’s 80 all-reduce operations per forward pass.\n\nWhy does this work well for prefill and not for decode? For prefill you’re processing 1000 tokens at once. Each all-reduce is syncing results for 1000 tokens worth of computation. The compute is so large that the communication overhead i.e. the time spent broadcasting and summing across GPUs is a small fraction of total time (worth it).\n\nFor decode you’re generating one token. Each all-reduce is syncing results for one token worth of computation. The compute is tiny but you’re still paying the full communication cost 80 times per token. It’s like organizing a 4 person video call to share a single sentence. The overhead of setting up the call is bigger than the actual conversation.\n\nInstead of splitting weight matrices, pipeline parallelism splits the model layers. If you have a 80 layer model and 4 GPUs, GPU 1 handles layers 1–20, GPU 2 handles layers 21–40, GPU 3 handles layers 41–60, GPU 4 handles layers 61–80.\n\nThink of it like a factory assembly line. The input goes into GPU 1, gets processed through layers 1–20, then the output (called activations) gets passed to GPU 2 which processes layers 21–40, and so on until GPU 4 produces the final output.\n\nThe communication here is much simpler than tensor parallelism as you’re just passing a single activation tensor from one GPU to the next. No all-reduce, no broadcasting to every GPU, it’s just GPU 1 handing off to GPU 2.\n\nThis sounds great but pipeline parallelism introduces bubble overhead. Unless the pipeline is fully saturated with enough concurrent work, some GPUs remain underutilized while others are active.\n\nFor prefill you can fill these bubbles by processing multiple requests in a pipeline, while GPU 2 is processing request 1, GPU 1 starts processing request 2. The pipeline stays full. But for decode where you’re generating one token at a time for each request, keeping the pipeline full is much harder.\n\nData parallelism is the simplest idea of the three. Don’t split the model at all. Just run full copies of it on different GPUs and give each copy different requests to handle.\n\nGPU 1 has the complete 70B model and handles users 1–100. GPU 2 has a complete copy and handles users 101–200. They never need to talk to each other. No all-reduce, no activation passing, no pipeline bubbles. Each GPU is fully independent.\n\nThe tradeoff is memory. A 70B model in fp16 is 140GB. Each GPU needs 140GB just to hold the model weights (expensive!).\n\nThere’s another issue beyond raw hardware efficiency: scheduling.\n\nImagine two requests arrive at the same time.\n\nRequest A has a 200k token prompt and needs a massive prefill pass.\n\nRequest B is a normal chat request already in decode and just needs the next token.\n\nIf both share the same GPU pool, Request A can delay Request B even though their workloads are completely different. That creates head of line blocking. A long prompt heavy request can slow down latency sensitive token generation for everyone else. So the problem isn’t just compute versus memory. It’s also interference between different classes of traffic.\n\nThat makes the case for disaggregation even stronger.\n\nAt this point, we know that tensor parallelism, pipeline parallelism, and data parallelism all have different trade-offs, but they all make the same assumption:\n\nPrefill and decode run on the same GPUs (and that’s the real problem). So, what if prefill and decode didn’t run on the same GPUs at all? That’s where disaggregation comes in.\n\nInstead of one shared GPU pool, you create two separate GPU pools, each optimized for one job. Prefill GPUs handle incoming prompts and build the KV cache using aggressive batching and tensor parallelism. Decode GPUs focus purely on token generation and are tuned for memory bandwidth and concurrency.\n\nOnce prefill finishes, the prefill GPU simply ships the KV cache over the network to a decode GPU, which takes over generation. The request flow now looks like this:\n\nPrompt -> Prefill Cluster -> KV Cache Transfer -> Decode Cluster -> Output Tokens\n\nNow each cluster can use the parallelism strategy that actually works best for its workload (no more compromise).\n\nThe [Splitwise paper](https://arxiv.org/html/2311.18677v2) came out in 2023 and by 2024 many modern serving systems were already adopting it.\n\nDisaggregation isn’t free though :/\n\nWhen prefill finishes, the KV cache needs to physically move from the prefill GPU to the decode GPU over the network. So, for long prompts on large models, KV cache size can easily reach hundreds of MB or even multiple GB. That transfer shows up as a gap between prefill finishing and the first token streaming which directly hurts TTFT, the exact thing disaggregation was supposed to fix.\n\nSo how do you deal with this? A few approaches:\n\nIn practice the transfer cost is worth it, the gains from specialized GPU pools far outweigh the network overhead. But it requires real engineering to get right and it’s an active area of research.\n\nDisaggregation went from a research paper to production infrastructure embarrassingly fast. The Splitwise paper came out in 2023 and by 2024, since then many modern serving systems are either adopting or actively experimenting with disaggregation.\n\nSGLang has disaggregation integration and vLLM is adding disaggregation support. The major cloud providers are building their own versions internally. And Mooncake, the serving system behind Kimi (one of the largest LLM deployments in China), is built entirely around disaggregated prefill and decode.\n\nThe reason it spread so fast is actually pretty intuitive once you think about it. As context windows got longer (4k to 32k to 128k to 1M tokens) prefill got proportionally more expensive. The interference problem that was mildly annoying at 4k context becomes completely painful at 128k context. Disaggregation went from a nice optimization to a hard requirement almost overnight.\n\nAnd that’s the thing about disaggregation. It sounds almost too obvious once you see it. Prefill and decode are not the same workload, they never were, and yet we kept forcing them onto the same GPUs and wondering why latency was unpredictable.\n\nThe solution is to have two pools, two jobs, each optimized for exactly what it needs. The KV cache transfer cost is real but manageable. The scheduling isolation alone is worth it.\n\nIf you’ve been following this series: KV cache was about memory, speculative decoding was about speed, continuous batching was about throughput. This one is about realizing the architecture itself was the bottleneck all along.\n\n[Prefill/Decode Disaggregation: Why Your GPU Can’t Do Two Things at Once](https://pub.towardsai.net/prefill-decode-disaggregation-why-your-gpu-cant-do-two-things-at-once-f11ba0bdd9de) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/prefill-decode-disaggregation-why-your-gpu-cant-do-two-things-at-once", "canonical_source": "https://pub.towardsai.net/prefill-decode-disaggregation-why-your-gpu-cant-do-two-things-at-once-f11ba0bdd9de?source=rss----98111c9905da---4", "published_at": "2026-06-25 16:31:00+00:00", "updated_at": "2026-06-25 16:52:22.547840+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "machine-learning"], "entities": ["Carnegie Mellon University", "ChatGPT", "GPU", "HBM"], "alternates": {"html": "https://wpnews.pro/news/prefill-decode-disaggregation-why-your-gpu-cant-do-two-things-at-once", "markdown": "https://wpnews.pro/news/prefill-decode-disaggregation-why-your-gpu-cant-do-two-things-at-once.md", "text": "https://wpnews.pro/news/prefill-decode-disaggregation-why-your-gpu-cant-do-two-things-at-once.txt", "jsonld": "https://wpnews.pro/news/prefill-decode-disaggregation-why-your-gpu-cant-do-two-things-at-once.jsonld"}}