{"slug": "how-much-vram-do-you-actually-need-to-run-llama-3-or-gemma-locally", "title": "How much VRAM do you actually need to run Llama 3 or Gemma locally?", "summary": "A developer calculated the actual VRAM requirements for running Llama 3 8B and Gemma 2 9B locally, revealing that the KV cache can consume far more memory than the model weights, especially at longer context lengths. For example, Llama 3 8B at Q4_K_M uses about 4.3GB for weights but the KV cache grows from 1GB at 8K context to 16GB at 128K, often causing out-of-memory errors. Gemma 2 9B requires even more VRAM due to its larger KV cache, despite having only one billion more parameters.", "body_md": "Every few days someone in a local LLM thread asks the same question: \"will this run on my 3060?\" And the answers are almost always vibes. \"Should be fine.\" \"Probably need to quantize.\" Nobody shows the math, so you download 16GB, load it up, and find out the hard way.\n\nI did exactly that a while back. Grabbed an 8B model, it loaded fine on a 12GB card, I felt clever, and then it OOM'd about 20,000 tokens into a long document. The weights fit. The KV cache didn't. That gap is the whole reason for this post.\n\nSo here is the actual math, with real numbers for Llama 3 and Gemma, including the part that surprised me, where two models that look identical on paper need very different amounts of memory.\n\nWhen you run a model locally, your GPU memory goes to three places:\n\nMost \"how much VRAM\" answers only talk about the first one. That is the mistake.\n\nThis one is simple. The weights take up `parameters × bytes per weight`\n\n. Full precision (FP16) is 2 bytes per weight, and quantization shrinks that:\n\n| Format | Bytes/weight | Llama 3 8B weights |\n|---|---|---|\n| FP16 | 2.0 | ~15 GB |\n| Q8_0 | ~1.06 | ~8 GB |\n| Q5_K_M | ~0.73 | ~5.5 GB |\n| Q4_K_M | ~0.58 | ~4.3 GB |\n| Q3_K_M | ~0.46 | ~3.5 GB |\n\nQ4_K_M is the one I reach for. It is the usual sweet spot: roughly a quarter of the FP16 size, with quality that is hard to tell apart for most tasks. So an 8B model is about 4.3GB of weights. Easy. Fits anything.\n\nAnd that is the number that lies to you, because it is only part of the story.\n\nWhen a model generates text, it caches the key and value vectors for every token it has already seen, so it does not recompute them on every new token. That cache is the KV cache, and it grows linearly with context length. Long prompt, big cache.\n\nThe formula:\n\n```\nKV bytes = 2 × layers × kv_dim × context_length × bytes_per_element\n```\n\nThe leading 2 is one slot for keys and one for values. For Llama 3 8B that is 32 layers, a KV dimension of 1024 (it uses grouped-query attention, so the KV heads are smaller than the attention heads), and 2 bytes per element for an FP16 cache:\n\n```\n2 × 32 × 1024 × 8192 × 2  ≈  1 GB at 8K context\n```\n\nSo far so good, 1GB is nothing. But watch what happens as the context grows, because the weights stay put and the cache does not:\n\nSixteen gigabytes of KV cache for a model whose weights are four. That is why your model loads fine and then dies halfway through a long document. You did not run out of room for the model. You ran out of room for its memory of the conversation.\n\nCUDA reserves some memory, activations need scratch space, and allocators leave gaps. I budget about 10% on top of weights plus cache. It is a rule of thumb, not a law, but it keeps you from cutting it too fine.\n\nQ4_K_M weights (about 4.3GB) plus 1GB of KV at 8K plus 10% overhead lands around 5.8GB total. That fits a 12GB card with plenty of headroom, and even an 8GB card with a little room to spare. Push the context to 32K and you are at about 9GB, still fine on 12GB. Go to a 128K context and the KV cache alone is bigger than the weights, and now you need a 24GB card.\n\nSame model, same quant. The only thing that changed was how much text you fed it.\n\nGemma 2 9B and Llama 3 8B look like the same weight class. A billion parameters apart, both run on a normal gaming GPU, so you would assume they need about the same VRAM.\n\nRun the math. The weights are close, a touch over 4GB for Llama and about 5GB for Gemma at Q4_K_M. But the KV cache at 8K is roughly 2.6GB for Gemma, not 1GB. Gemma uses a larger head dimension and more layers, so its kv_dim is double Llama's and it has ten more layers to cache. Total comes out around 8.4GB, versus Llama's 5.8GB.\n\nA billion more parameters, but about 2.5GB more VRAM, almost all of it hiding in the KV cache. You would never guess that from the parameter count, and it is exactly the kind of thing that turns \"should fit\" into an OOM at the worst moment.\n\nWorking this out per model, per quant, per context length got old, so I built a calculator that does it: [LLM VRAM Calculator](https://codeswap.net/llm/llm-vram-calculator/). Pick a model (or punch in your own params, layers, and KV dim), choose a quant and a context length, and it breaks out weights, KV cache, and overhead, then tells you which GPUs it fits on. It runs in the browser, and nothing gets uploaded.\n\nA few things worth knowing once you can see the breakdown:\n\nThe rule of thumb I actually use: take the weight size from your quant, add about 1GB of KV per 8K of context for a 7 to 8B model (more for Gemma-style architectures), then 10% on top. Or skip the arithmetic and check the calculator before you download 16 gigabytes.\n\nIf you run something with a wildly different memory profile than the parameter count suggests, I would genuinely like to hear it. Those are the ones worth knowing about before you hit buy on a GPU.", "url": "https://wpnews.pro/news/how-much-vram-do-you-actually-need-to-run-llama-3-or-gemma-locally", "canonical_source": "https://dev.to/sathvic_kollu/how-much-vram-do-you-actually-need-to-run-llama-3-or-gemma-locally-3heg", "published_at": "2026-06-17 03:56:47+00:00", "updated_at": "2026-06-17 04:51:37.402418+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["Llama 3", "Gemma", "Q4_K_M", "FP16", "KV cache", "CUDA", "8B model", "9B model"], "alternates": {"html": "https://wpnews.pro/news/how-much-vram-do-you-actually-need-to-run-llama-3-or-gemma-locally", "markdown": "https://wpnews.pro/news/how-much-vram-do-you-actually-need-to-run-llama-3-or-gemma-locally.md", "text": "https://wpnews.pro/news/how-much-vram-do-you-actually-need-to-run-llama-3-or-gemma-locally.txt", "jsonld": "https://wpnews.pro/news/how-much-vram-do-you-actually-need-to-run-llama-3-or-gemma-locally.jsonld"}}