{"slug": "prompt-caching-in-llms-the-hidden-optimization-saving-millions-of-gpu-hours", "title": "Prompt Caching in LLMs: The Hidden Optimization Saving Millions of GPU Hours", "summary": "Shrijith Venkatramana, developer of git-lrc, explains how prompt caching in LLMs can dramatically reduce latency and cost by reusing internal representations from previous requests. The technique caches key/value tensors from transformer layers, enabling models to skip recomputation for shared prompt prefixes. This optimization trades GPU memory for computation, saving millions of GPU hours in practice.", "body_md": "*Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.*\n\nEvery developer eventually discovers the same frustrating pattern.\n\nYour application sends a 20,000-token prompt to an LLM. The first request takes 2 seconds. The next request contains the exact same 20,000 tokens plus a tiny user message at the end.\n\nAnd somehow the model processes the entire thing again.\n\nAt least, that's what many developers assume.\n\nModern LLM systems have a trick called **prompt caching** that can dramatically reduce latency and cost by reusing work from previous requests. But unlike traditional application caches, prompt caching isn't storing generated text. It's storing something much deeper inside the model.\n\nTo understand how prompt caching works, we need to follow a prompt all the way through the transformer itself.\n\nWhen a prompt enters a transformer model, it isn't immediately generating text.\n\nFirst, the model must process every input token through every layer of the network.\n\nImagine a prompt like:\n\n```\nSystem: You are a helpful coding assistant.\n\nProject Documentation:\n[20,000 tokens of documentation]\n\nUser: How does authentication work?\n```\n\nBefore generating a single output token, the model performs:\n\n...across dozens or even hundreds of transformer layers.\n\nFor a large model, this preprocessing is often more expensive than generating a short answer.\n\nIf another user asks:\n\n```\nSystem: You are a helpful coding assistant.\n\nProject Documentation:\n[Same 20,000 tokens]\n\nUser: Explain the database schema.\n```\n\nMost of the prompt is identical.\n\nWithout caching, the model would recompute everything from scratch.\n\nPrompt caching exists to avoid that waste.\n\nA common misconception is that prompt caching stores prompt text.\n\nThat's not particularly useful because the model would still need to process the text again.\n\nInstead, modern systems cache the transformer's internal representations.\n\nAfter processing a token through the network, the model produces vectors that represent the token's state at various stages.\n\nThe most important cached data is usually:\n\nThese are generated during self-attention.\n\nOnce a prefix has been processed, those K/V tensors can often be reused.\n\nConceptually:\n\n```\nPrompt\n  ↓\nToken Embeddings\n  ↓\nTransformer Layers\n  ↓\nKey/Value Tensors\n  ↓\nCache\n```\n\nWhen a future request begins with the same prefix, the system loads the cached tensors rather than recomputing them.\n\nThe model effectively starts from the middle of the computation.\n\nPrompt caching builds directly on a mechanism called the KV cache.\n\nDuring inference, each attention layer creates:\n\n```\nQ = Query\nK = Key\nV = Value\n```\n\nAttention is computed roughly as:\n\n\\text{Attention}(Q,K,V)=\\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d}}\\right)V\n\nWhen generating token 501, the model doesn't want to recompute attention for tokens 1-500.\n\nInstead it stores the previous K and V tensors.\n\nThis is the standard KV cache used during autoregressive generation.\n\nPrompt caching extends the same idea across requests.\n\nInstead of caching:\n\n```\nRequest A token 1-500\n```\n\nit caches:\n\n```\nShared prompt prefix\n```\n\nwhich can then be reused by:\n\n```\nRequest B\nRequest C\nRequest D\n```\n\nas long as the prefix remains identical.\n\nLet's use a realistic example.\n\nSuppose we have:\n\n```\nSystem Prompt: 2,000 tokens\nRepository Documentation: 18,000 tokens\nUser Message: 100 tokens\n```\n\nTotal:\n\n```\n20,100 tokens\n```\n\nAssume a model has:\n\nFor each layer, the system stores K and V tensors for every processed token.\n\nConceptually:\n\n```\nLayer 1:\n  K[20000]\n  V[20000]\n\nLayer 2:\n  K[20000]\n  V[20000]\n\n...\n\nLayer 80:\n  K[20000]\n  V[20000]\n```\n\nThe cache may occupy hundreds of megabytes or even gigabytes depending on:\n\nThis is why prompt caching isn't free.\n\nThe system trades memory for computation.\n\nGPU memory is expensive, but recomputing a 20,000-token prompt repeatedly is often even more expensive.\n\nMost production systems perform prompt caching using prefix matching.\n\nConsider:\n\n```\n[System Prompt]\n[Documentation]\nUser: Explain auth\n```\n\nand\n\n```\n[System Prompt]\n[Documentation]\nUser: Explain database\n```\n\nThe shared prefix is:\n\n```\n[System Prompt]\n[Documentation]\n```\n\nEverything after that differs.\n\nThe cache can be reused because the transformer state for the shared prefix is identical.\n\nBut even small changes can invalidate the cache:\n\n```\nVersion 1:\nRepository version: 2.1\n\nVersion 2:\nRepository version: 2.2\n```\n\nThat tiny change alters tokenization.\n\nDifferent tokens produce different embeddings.\n\nDifferent embeddings produce different K/V tensors.\n\nThe entire downstream computation changes.\n\nThis is why prompt caching systems often require exact token-level matches rather than semantic similarity.\n\nDifferent providers implement prompt caching differently, but the general architecture is similar.\n\n```\nIncoming Request\n       ↓\nPrefix Detection\n       ↓\nCache Lookup\n       ↓\nCache Hit?\n    /      \\\n  Yes      No\n   |         |\nLoad KV   Compute KV\n   |         |\nGenerate Response\n```\n\nThe difficult engineering problems include:\n\nGPU memory is limited.\n\nProviders must decide:\n\nThis resembles operating system page management more than traditional web caching.\n\nLarge serving systems spread requests across many GPUs.\n\nA cached prefix may exist on GPU A while the next request arrives on GPU B.\n\nProviders must either:\n\nA cache created for one customer should not leak information to another customer.\n\nProduction systems must maintain strict isolation boundaries.\n\nRetrieval-Augmented Generation systems are perfect candidates for prompt caching.\n\nImagine a code assistant.\n\nEvery request includes:\n\n```\nSystem Prompt\nRepository Rules\nArchitecture Docs\nCoding Standards\n```\n\nOnly the user question changes.\n\nWithout caching:\n\n```\n20,000 tokens processed\n20,000 tokens processed\n20,000 tokens processed\n20,000 tokens processed\n```\n\nWith caching:\n\n```\n20,000-token prefix processed once\n\nRequest 2:\nreuse cache\n\nRequest 3:\nreuse cache\n\nRequest 4:\nreuse cache\n```\n\nLatency drops.\n\nGPU utilization drops.\n\nCost drops.\n\nThis is one reason why modern coding assistants can feel much faster than their raw context sizes would suggest.\n\nToday's prompt caching mostly relies on exact token matches.\n\nResearchers are exploring more ambitious ideas:\n\nThe challenge is preserving correctness.\n\nExact matches guarantee identical transformer states.\n\nApproximate matches introduce uncertainty.\n\nFuture systems may combine both approaches, using exact caches when possible and semantic reuse when beneficial.\n\nPrompt caching is one of the least visible but most impactful optimizations in modern LLM serving.\n\nThe important realization is that the cache is not storing text and it is not storing generated responses.\n\nIt is storing the expensive internal transformer state—primarily key and value tensors—that would otherwise need to be recomputed.\n\nOnce you understand that, prompt caching starts looking less like an application-level optimization and more like a CPU instruction cache or an operating system memory cache: a mechanism for avoiding repeated work by preserving computation that has already been paid for.\n\nAs context windows continue growing from tens of thousands to millions of tokens, do you think exact prefix caching will remain dominant, or will future LLM systems need semantic and approximate caching techniques to stay efficient?\n\n*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.\n\ngit-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*\n\nAny feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.\n\n| [🇩🇰 Dansk](https://github.com/HexmosTech/git-lrc/readme/README.da.md) | [🇪🇸 Español](https://github.com/HexmosTech/git-lrc/readme/README.es.md) | [🇮🇷 Farsi](https://github.com/HexmosTech/git-lrc/readme/README.fa.md) | [🇫🇮 Suomi](https://github.com/HexmosTech/git-lrc/readme/README.fi.md) | [🇯🇵 日本語](https://github.com/HexmosTech/git-lrc/readme/README.ja.md) | [🇳🇴 Norsk](https://github.com/HexmosTech/git-lrc/readme/README.nn.md) | [🇵🇹 Português](https://github.com/HexmosTech/git-lrc/readme/README.pt.md) | [🇷🇺 Русский](https://github.com/HexmosTech/git-lrc/readme/README.ru.md) | [🇦🇱 Shqip](https://github.com/HexmosTech/git-lrc/readme/README.sq.md) | [🇨🇳 中文](https://github.com/HexmosTech/git-lrc/readme/README.zh.md) | [🇮🇳 हिन्दी](https://github.com/HexmosTech/git-lrc/readme/README.hi.md) |\n\nAI agents write code fast. They also *silently remove logic*, change behavior, and introduce bugs -- without telling you. You often find out in production.\n\n** git-lrc fixes this.** It hooks into\n\n`git commit`\n\nand reviews every diff git-lrc-intro-60s.mp4See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements", "url": "https://wpnews.pro/news/prompt-caching-in-llms-the-hidden-optimization-saving-millions-of-gpu-hours", "canonical_source": "https://dev.to/shrsv/prompt-caching-in-llms-the-hidden-optimization-saving-millions-of-gpu-hours-4gmm", "published_at": "2026-06-14 18:54:16+00:00", "updated_at": "2026-06-14 19:10:37.977437+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["Shrijith Venkatramana", "git-lrc"], "alternates": {"html": "https://wpnews.pro/news/prompt-caching-in-llms-the-hidden-optimization-saving-millions-of-gpu-hours", "markdown": "https://wpnews.pro/news/prompt-caching-in-llms-the-hidden-optimization-saving-millions-of-gpu-hours.md", "text": "https://wpnews.pro/news/prompt-caching-in-llms-the-hidden-optimization-saving-millions-of-gpu-hours.txt", "jsonld": "https://wpnews.pro/news/prompt-caching-in-llms-the-hidden-optimization-saving-millions-of-gpu-hours.jsonld"}}