{"slug": "pagedattention-is-more-than-virtual-memory", "title": "PagedAttention is more than virtual memory", "summary": "PagedAttention, a memory optimization technique in the vLLM inference server, applies virtual memory concepts to manage the KV cache in large language models, improving throughput by reducing fragmentation. The KV cache grows rapidly with sequence length, limiting batch sizes and thus throughput; PagedAttention enables non-contiguous storage and sharing, boosting efficiency but introducing potential privacy leaks.", "body_md": "Second edition with machine learning, deep learning, LLMs & AI available now!Buy now\n\nPagedAttention is more than virtual memory\n\nIf you’re reading this via email, I recommend reading in browser to see the visualisations\n\nIn this week’s missive I want to talk about PagedAttention, the memory optimisation innovation in vLLM. If you’ve never heard of vLLM, it’s an LLM inference server that’s used to efficiently serve models. Now, PagedAttention is very like virtual memory in operating systems, and indeed the vLLM paper makes that exact comparison. It would be a perfect fit for my “AI infrastructure is recreating computer science fundamentals” thesis, except there are already about half a million posts telling you that PagedAttention is like virtual memory.\n\nSo I’m going to do something more interesting. We’ll start off by seeing why serving LLMs is memory intensive (stand up, KV cache!) and, yes, understand how PagedAttention uses ideas from OS virtual memory systems to optimise memory usage. But I want the bulk of this post to focus on how this optimisation entails several interesting consequences that don’t get as much attention. For instance: how you get virtual memory’s payoff without the hardware that usually makes it fast, why the very same sharing that saves memory can leak one user’s prompt to another, and how the cache turned into something you can share across machines (and why you’d care).\n\nBy the end of this post you’ll have a good grasp of the computer architectural issues around serving LLMs, the optimisations coming into use and how they link back to existing computer science patterns.\n\nRespect the KV cache\n\nYou probably know that large language models can be very… large. Frontier models have billions or trillions of parameters, all needing storage. Even a smallish model intended for local inference is going to require at least a few gigabytes of memory. But when the model starts generating output, memory is swallowed by another source.\n\nDue to decoder models using masked self-attention, each new token attends to every token before it. Attention needs a key and a value vector for every earlier token in every layer of the network. These vectors never change once computed, so the obvious optimisation is to memoise – compute once and store for reuse. That is the KV cache and it gets big quickly.\n\nFor a worked example, Llama 3 8B has 32 layers and, because it uses grouped-query attention, just 8 key-value heads per layer, each storing a key and a value vector of length 128 at two bytes per element. That’s 128 KB of cache per token, adding up to half a gigabyte for a single 4,096-token conversation. An 80-layer 70B model is closer to 320 KB per token. Imagine this during a long, agentic workflow reading lots of files into context and you can see how the KV cache will rapidly balloon in size with no warning of when it’ll stop.\n\nManaging the KV cache is critical for LLM inference performance because it limits throughput. A GPU generating text spends most of its time moving weights rather than doing maths. Every decoding step streams all 16 GB of Llama’s 8 billion weights past the compute units, producing one token for each running sequence. Serve one user’s sequence and you get one token. Serve forty users in one batch and you get forty tokens in the same time. Generation is bound by memory bandwidth, not compute, and the same weight-read serves the whole batch at once. So the more sequences you fit in memory, the more tokens each read produces, and your throughput rises with the batch size. The catch is that the batch can only be as large as the KV cache has room for, so every byte you waste on fragmentation is a sequence you can’t run and a user you can’t serve.\n\nThat’s why the vLLM team thought memory optimisation was worth fixing with a whole new memory manager in software. PagedAttention is virtual memory for the KV cache. Before vLLM, servers stored each request’s KV cache in a contiguous chunk of GPU memory sized for the longest reply allowed. A request that stopped after 100 tokens kept its full allocation anyway, and requests finishing at different times left gaps too small to use, so only 20–40% of cache memory held live tokens.\n\nOperating system engineers met exactly this problem way back in the 1960s and solved it with paging. For a fuller treatment read The Computer Science Book, but very briefly the solution is: chop memory into fixed-size pieces (pages), let any piece sit anywhere, and keep a per-program page table translating the addresses the program sees into wherever its bytes actually are. The program thinks its address space is arranged consistently in memory while, behind the scenes, the OS is free to rearrange pages to optimise memory usage and reduce fragmentation.\n\nvLLM does the same to the KV cache: virtual blocks of sixteen tokens, a pool of physical blocks, and a per-sequence block table mapping virtual token positions to their physical counterparts. Fragmentation falls from 60–80% of memory to near zero. The only unused memory is the unused tail of each request’s last block.\n\nWatch the visualisation below to understand how it works. You see the same GPU serving the same request stream. On the left is a naive memory allocator and on the right is a vLLM-style paged allocator:", "url": "https://wpnews.pro/news/pagedattention-is-more-than-virtual-memory", "canonical_source": "https://thecomputersciencebook.com/posts/pagedattention-is-more-than-virtual-memory/", "published_at": "2026-06-16 08:45:33+00:00", "updated_at": "2026-06-16 08:48:17.039784+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools"], "entities": ["vLLM", "PagedAttention", "Llama 3 8B", "KV cache"], "alternates": {"html": "https://wpnews.pro/news/pagedattention-is-more-than-virtual-memory", "markdown": "https://wpnews.pro/news/pagedattention-is-more-than-virtual-memory.md", "text": "https://wpnews.pro/news/pagedattention-is-more-than-virtual-memory.txt", "jsonld": "https://wpnews.pro/news/pagedattention-is-more-than-virtual-memory.jsonld"}}