# How much VRAM do you actually need to run Llama 3 or Gemma locally?

> Source: <https://dev.to/sathvic_kollu/how-much-vram-do-you-actually-need-to-run-llama-3-or-gemma-locally-3heg>
> Published: 2026-06-17 03:56:47+00:00

Every few days someone in a local LLM thread asks the same question: "will this run on my 3060?" And the answers are almost always vibes. "Should be fine." "Probably need to quantize." Nobody shows the math, so you download 16GB, load it up, and find out the hard way.

I did exactly that a while back. Grabbed an 8B model, it loaded fine on a 12GB card, I felt clever, and then it OOM'd about 20,000 tokens into a long document. The weights fit. The KV cache didn't. That gap is the whole reason for this post.

So here is the actual math, with real numbers for Llama 3 and Gemma, including the part that surprised me, where two models that look identical on paper need very different amounts of memory.

When you run a model locally, your GPU memory goes to three places:

Most "how much VRAM" answers only talk about the first one. That is the mistake.

This one is simple. The weights take up `parameters × bytes per weight`

. Full precision (FP16) is 2 bytes per weight, and quantization shrinks that:

| Format | Bytes/weight | Llama 3 8B weights |
|---|---|---|
| FP16 | 2.0 | ~15 GB |
| Q8_0 | ~1.06 | ~8 GB |
| Q5_K_M | ~0.73 | ~5.5 GB |
| Q4_K_M | ~0.58 | ~4.3 GB |
| Q3_K_M | ~0.46 | ~3.5 GB |

Q4_K_M is the one I reach for. It is the usual sweet spot: roughly a quarter of the FP16 size, with quality that is hard to tell apart for most tasks. So an 8B model is about 4.3GB of weights. Easy. Fits anything.

And that is the number that lies to you, because it is only part of the story.

When a model generates text, it caches the key and value vectors for every token it has already seen, so it does not recompute them on every new token. That cache is the KV cache, and it grows linearly with context length. Long prompt, big cache.

The formula:

```
KV bytes = 2 × layers × kv_dim × context_length × bytes_per_element
```

The leading 2 is one slot for keys and one for values. For Llama 3 8B that is 32 layers, a KV dimension of 1024 (it uses grouped-query attention, so the KV heads are smaller than the attention heads), and 2 bytes per element for an FP16 cache:

```
2 × 32 × 1024 × 8192 × 2  ≈  1 GB at 8K context
```

So far so good, 1GB is nothing. But watch what happens as the context grows, because the weights stay put and the cache does not:

Sixteen gigabytes of KV cache for a model whose weights are four. That is why your model loads fine and then dies halfway through a long document. You did not run out of room for the model. You ran out of room for its memory of the conversation.

CUDA reserves some memory, activations need scratch space, and allocators leave gaps. I budget about 10% on top of weights plus cache. It is a rule of thumb, not a law, but it keeps you from cutting it too fine.

Q4_K_M weights (about 4.3GB) plus 1GB of KV at 8K plus 10% overhead lands around 5.8GB total. That fits a 12GB card with plenty of headroom, and even an 8GB card with a little room to spare. Push the context to 32K and you are at about 9GB, still fine on 12GB. Go to a 128K context and the KV cache alone is bigger than the weights, and now you need a 24GB card.

Same model, same quant. The only thing that changed was how much text you fed it.

Gemma 2 9B and Llama 3 8B look like the same weight class. A billion parameters apart, both run on a normal gaming GPU, so you would assume they need about the same VRAM.

Run the math. The weights are close, a touch over 4GB for Llama and about 5GB for Gemma at Q4_K_M. But the KV cache at 8K is roughly 2.6GB for Gemma, not 1GB. Gemma uses a larger head dimension and more layers, so its kv_dim is double Llama's and it has ten more layers to cache. Total comes out around 8.4GB, versus Llama's 5.8GB.

A billion more parameters, but about 2.5GB more VRAM, almost all of it hiding in the KV cache. You would never guess that from the parameter count, and it is exactly the kind of thing that turns "should fit" into an OOM at the worst moment.

Working this out per model, per quant, per context length got old, so I built a calculator that does it: [LLM VRAM Calculator](https://codeswap.net/llm/llm-vram-calculator/). Pick a model (or punch in your own params, layers, and KV dim), choose a quant and a context length, and it breaks out weights, KV cache, and overhead, then tells you which GPUs it fits on. It runs in the browser, and nothing gets uploaded.

A few things worth knowing once you can see the breakdown:

The rule of thumb I actually use: take the weight size from your quant, add about 1GB of KV per 8K of context for a 7 to 8B model (more for Gemma-style architectures), then 10% on top. Or skip the arithmetic and check the calculator before you download 16 gigabytes.

If you run something with a wildly different memory profile than the parameter count suggests, I would genuinely like to hear it. Those are the ones worth knowing about before you hit buy on a GPU.
