{"slug": "kv-cache-explained-like-you-re-an-llm-engineer", "title": "KV Cache Explained Like You're an LLM Engineer", "summary": "The KV cache is a critical optimization for LLM inference that stores the Key and Value matrices from previously generated tokens, eliminating the need to recompute attention over the entire sequence at each generation step. Without it, autoregressive generation would require a full forward pass through the transformer for every new token, making long responses computationally prohibitive. By caching K and V values for already-processed tokens, the KV cache trades increased memory usage for dramatically faster generation, enabling practical deployment of large models.", "body_md": "How transformer inference actually works under the hood — and why KV cache is the single most important optimization keeping your LLM from crawling.\nIf you've ever wondered why LLMs respond fast even on long prompts — the answer is KV cache. But most explanations stop at \"it stores keys and values.\" This goes deeper.\nWhat You'll Learn\nBy the end of this article you'll understand:\nIntroduction: Why LLM Inference is Expensive\nLet's start with an uncomfortable truth.\nWhen you send a prompt to GPT-4 or Claude and watch that first token appear, your GPU has just burned through millions of floating-point operations before producing a single character. And then, for every subsequent token in the response — it does it again.\nNot a small version. The full computation. Attention over the entire sequence. Every time.\nWithout optimization, a 7B parameter model generating a 200-token response would recompute attention across the full growing sequence 200 times. For a 70B model on a context of 4,096 tokens, that's not slow — it's practically unusable in production.\nThis is the core economics problem of LLM inference: autoregressive generation is inherently sequential and expensive. You can't parallelize generation across output tokens the way you parallelize training across a batch. Each new token depends on every token that came before it.\nKV cache is the engineering solution that makes modern LLM inference economically viable. It's not magic — it's a deliberate memory-compute tradeoff. Understanding it deeply is the difference between an ML engineer who deploys models and one who optimizes them.\nWhat Happens During Token Generation\nBefore we talk about caching, let's understand what we're caching from.\nLLMs generate text one token at a time, left to right. Each generation step:\nTakes the full input prompt + all previously generated tokens\nRuns a complete forward pass through the transformer\nProduces a probability distribution over the vocabulary\nSamples the next token from that distribution\nThen the new token is appended to the sequence, and the process repeats.\nHere's that loop in pseudocode:\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nmodel = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-2-7b-hf\")\ntokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Llama-2-7b-hf\")\ninput_ids = tokenizer.encode(\"The capital of France is\", return_tensors=\"pt\")\ngenerated = input_ids\nfor _ in range(50): # generate 50 tokens\nwith torch.no_grad():\n# Full forward pass every single step — expensive!\noutputs = model(generated)\nlogits = outputs.logits[:, -1, :] # last token's logits\nnext_token = torch.argmax(logits, dim=-1)\ngenerated = torch.cat([generated, next_token.unsqueeze(0)], dim=-1)\nprint(tokenizer.decode(generated[0]))\nNotice the problem: at step 50, generated has 50+ tokens. We're doing a full transformer forward pass over all of them just to predict token 51. The computation keeps growing linearly with sequence length.\nThe question is: why do we need to reprocess all previous tokens every time?\nTransformer Attention: A Quick Refresher\nTo understand why KV cache exists, you need to understand what attention actually computes.\nThe core of transformer inference is multi-head self-attention. For each layer and each token position, attention computes three projections:\nQ (Query): \"What am I looking for?\"\nK (Key): \"What do I contain?\"\nV (Value): \"What do I actually carry?\"\nThe attention output for position i is:\nAttention(Q_i, K, V) = softmax(Q_i · K^T / √d_k) · V\nIn words: token i asks a question (Q), broadcasts it across all token keys (K) to get attention weights, then uses those weights to take a weighted sum of all values (V).\nFor a sequence of n tokens, each token attends to all n tokens. This is O(n²) in both time and memory — which is why long contexts hurt so much.\nHere's the key insight:\nQ changes at every step. But K and V for already-processed tokens do NOT change.\nOnce a token has been processed by the transformer, its Key and Value projections are fixed. They only depend on the token's content and position — not on future tokens.\nThis is the mathematical justification for KV cache.\nWhy Recomputing Attention is Inefficient\nLet's make this concrete with numbers.\nA standard LLaMA-2 7B model has:\n32 transformer layers\n32 attention heads\nHidden dimension of 4096\nKV head dimension of 128\nFor a single token, the K and V projections at one layer are each vectors of size 128. Across 32 heads and 32 layers, storing the KV state for one token costs:\n2 (K and V) × 32 (layers) × 32 (heads) × 128 (head_dim) × 2 bytes (fp16)\n= 524,288 bytes ≈ 0.5 MB per token\nNow imagine a prompt of 2,048 tokens:\n2,048 tokens × 0.5 MB = 1 GB of KV state\nWithout caching, every decode step recomputes that entire 1 GB of KV state from scratch. With 200 decode steps, you're recomputing 200 GB of equivalent computation — just to generate a few hundred words.\nThe recompute path also saturates memory bandwidth. On an A100 (2 TB/s bandwidth), even just reading model weights once per step for a 13B model takes:\n13B params × 2 bytes/param = 26 GB per forward pass\n26 GB / 2000 GB/s = ~13ms per token → ~77 tokens/sec ceiling\nAny redundant recomputation cuts directly into this budget.\nWhat KV Cache Stores\nThe KV cache stores the Key and Value tensors for all previously processed tokens, so they don't need to be recomputed on subsequent decode steps.\nHere's the conceptual layout:\nAt each decode step, the model:\nAttention with KV cache:\ndef attention_with_cache(query_new, key_cache, value_cache, key_new, value_new):\n# Append new K/V to cache\nkeys = torch.cat([key_cache, key_new], dim=1)\nvalues = torch.cat([value_cache, value_new], dim=1)\n# New query attends to ALL keys (cached + new)\nscores = torch.einsum(\"bhd,bshd->bhs\", query_new, keys) / math.sqrt(d_k)\nweights = torch.softmax(scores, dim=-1)\noutput = torch.einsum(\"bhs,bshd->bhd\", weights, values)\nreturn output, keys, values # return updated cache\nResult: instead of O(n) recompute per step, we do O(1) query computation — a constant cost regardless of how long the sequence has grown.\nPrefill Phase vs. Decode Phase\nProduction LLM inference has two fundamentally different operating modes.\nPrefill Phase\nWhen you first submit a prompt, the model processes all prompt tokens in parallel:\nPrompt: [T_0, T_1, T_2, ..., T_2047]\n↓ ↓ ↓ ↓\n[All processed simultaneously via batch matrix ops]\n↓\n[Full KV cache populated for all 2048 positions]\n↓\n[First output token generated]\nPrefill is compute-bound — large matrix multiplications across all positions at once. GPUs excel here.\nPrefill time = your Time to First Token (TTFT). Users feel this as the delay before the first character appears.\nDecode Phase\nAfter prefill, each new token is generated one at a time:\nStep 1: New token T_2048\n→ Compute Q, K, V for T_2048 only\n→ Append K_2048, V_2048 to cache\n→ Attend over positions 0..2048\n→ Sample T_2049\nStep 2: New token T_2049\n→ Cache now has 2050 entries\n→ Attend over positions 0..2049\n→ Sample T_2050\nDecode is memory-bandwidth-bound — constantly reading the growing KV cache from GPU HBM. Compute cores sit largely idle waiting for memory reads.\nDecode step memory reads (LLaMA 7B, 2048 cache):\nModel weights: ~14 GB\nKV cache: ~1 GB (and growing)\nTotal: ~15 GB\nAt 2 TB/s bandwidth: ~7.5ms/token → ~133 tokens/sec theoretical max\nGPU Memory and KV Cache Growth\nHere's where things get expensive fast.\nKV cache size =\nbatch_size × seq_len × num_layers × num_heads × head_dim × 2 × dtype_bytes\nReal example — LLaMA-2 13B, batch size 8, 4K context:\nbatch_size = 8\nseq_len = 4096\nnum_layers = 40\nnum_kv_heads = 40\nhead_dim = 128\ndtype_bytes = 2 # fp16\nkv_cache_bytes = (\nbatch_size * seq_len * num_layers\n* num_kv_heads * head_dim * 2 * dtype_bytes\n)\nprint(f\"KV Cache: {kv_cache_bytes / 1e9:.2f} GB\")\n# KV Cache: 26.84 GB\nNow scale to batch 16 at 8K context:\nbatch_size = 16\nseq_len = 8192\n# KV Cache: 107.37 GB ← won't fit on a single A100\nA100-80GB Memory Budget (LLaMA-2 13B, fp16)\n┌─────────────────────────────────────────────┐\n│ Model Weights ~26 GB │\n├─────────────────────────────────────────────┤\n│ KV Cache ~30–40 GB │ ← the battleground\n├─────────────────────────────────────────────┤\n│ Activations / Other ~5–10 GB │\n├─────────────────────────────────────────────┤\n│ CUDA Runtime / Misc ~2–3 GB │\n└─────────────────────────────────────────────┘\nThe KV cache is the only dynamic component. It grows as sequences get longer, shrinks as sequences end, and fragments if not managed carefully. Every other component is fixed at load time.\nContinuous Batching and PagedAttention\nThe Problem with Static Batching\nTraditional inference servers used static batching: wait for N requests, run until all finish. But requests finish at different times. 7 out of 8 requests finishing at step 50 while one runs to step 500 means your GPU is 87.5% idle on the long tail.\nContinuous Batching\nModern engines batch at the iteration level, not the request level:\nTime → T0 T1 T2 T3 T4 T5 T6 T7\nSlot 0: [A A A A ←done→ E E E ]\nSlot 1: [B B ←done→ D D D D ]\nSlot 2: [C C C C C C ←done→ F ]\nAs soon as request A finishes, slot 0 immediately starts request E. GPU utilization stays high.\nBut this creates a new problem: KV cache memory fragmentation. Different requests have different lengths. You can't pre-allocate contiguous memory blocks without wasting huge amounts.\nPagedAttention (vLLM)\nvLLM's PagedAttention (2023) solved this with a virtual memory analogy borrowed from OS paging.\nInstead of contiguous KV blocks per request, memory is split into fixed-size pages (typically 16 tokens each). A block table maps logical pages to physical GPU memory:\nLogical view (Sequence A):\n[Page 0 → Page 1 → Page 2 → Page 3]\nPhysical GPU memory:\nBlock 7: tokens 0–15 of Sequence A\nBlock 2: tokens 16–31 of Sequence A\nBlock 15: tokens 32–47 of Sequence A\nBlock 3: tokens 48–63 of Sequence A\nBlock table:\nSeq A: { logical 0 → physical 7,\nlogical 1 → physical 2,\nlogical 2 → physical 15,\nlogical 3 → physical 3 }\nBenefits:\nMemory waste drops from ~30–40% (fragmentation) to under 4%.\n# Simplified block table concept\nclass BlockTable:\ndef __init__(self, block_size=16, num_blocks=1000):\nself.block_size = block_size\nself.free_blocks = list(range(num_blocks))\nself.block_data = {} # physical_block_id → KV tensor\nself.tables = {} # seq_id → {logical → physical}\ndef allocate_block(self):\nreturn self.free_blocks.pop()\ndef get_kv(self, seq_id, logical_block_idx):\nphysical = self.tables[seq_id][logical_block_idx]\nreturn self.block_data[physical]\nHow vLLM and Inference Engines Optimize KV Cache\nvLLM in Practice\nfrom vllm import LLM, SamplingParams\nllm = LLM(\nmodel=\"meta-llama/Llama-2-7b-hf\",\ntensor_parallel_size=2, # split across 2 GPUs\ngpu_memory_utilization=0.90, # leave 10% headroom\nmax_model_len=4096\n)\noutputs = llm.generate(\n[\"Explain KV cache to an ML engineer\"],\nSamplingParams(temperature=0.7, max_tokens=512)\n)\nvLLM ships: PagedAttention, continuous batching, Flash Attention v2, tensor parallelism, and prefix caching out of the box.\nFlash Attention\nFlash Attention avoids the expensive O(n²) memory allocation by tiling computations to fit in SRAM:\nStandard attention:\nQ, K, V → HBM → compute N×N matrix → HBM → output\nMemory: O(n²)\nFlash Attention:\nQ, K, V → SRAM tiles → fused kernel → output\nNever materializes full N×N matrix in HBM\nMemory: O(n) Speed: 2–4× faster\nPrefix Caching\nIf thousands of users share the same system prompt, their KV caches for that prefix are identical. Compute it once, share it everywhere:\nWithout prefix caching:\nRequest 1: compute KV for [SYSTEM: 2048 tokens] + [query A]\nRequest 2: compute KV for [SYSTEM: 2048 tokens] + [query B]\nRequest 3: compute KV for [SYSTEM: 2048 tokens] + [query C]\nWith prefix caching:\nRequest 1: compute + store KV for [SYSTEM]\nRequests 2, 3, ...: load cached KV, compute only [query N]\nTTFT reduction: 50–90% on system-prompt-heavy workloads.\nKV Cache Challenges in Long-Context Models\nModern models like GPT", "url": "https://wpnews.pro/news/kv-cache-explained-like-you-re-an-llm-engineer", "canonical_source": "https://dev.to/murali8k/kv-cache-explained-like-youre-an-llm-engineer-gbm", "published_at": "2026-05-20 06:20:37+00:00", "updated_at": "2026-05-20 06:31:32.312007+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "research", "developer-tools"], "entities": ["GPT-4", "Claude", "KV cache"], "alternates": {"html": "https://wpnews.pro/news/kv-cache-explained-like-you-re-an-llm-engineer", "markdown": "https://wpnews.pro/news/kv-cache-explained-like-you-re-an-llm-engineer.md", "text": "https://wpnews.pro/news/kv-cache-explained-like-you-re-an-llm-engineer.txt", "jsonld": "https://wpnews.pro/news/kv-cache-explained-like-you-re-an-llm-engineer.jsonld"}}