How Much RAM Do You Really Need to Run LLMs Locally? 2026 Benchmarks

A developer provides a formula for estimating RAM requirements to run large language models locally, explaining that a 7B model at Q4 quantization needs roughly 4.2GB plus overhead. Benchmarks show that a mid-range GPU with 12GB VRAM can achieve 45-70 tokens per second for 7B models, while CPU-only systems manage 5-9 tok/s. The developer recommends Q4_K_M quantization as the best balance of size and quality, and notes that Apple Silicon's unified memory offers an advantage over typical RAM-only PCs.

"Will it run on my machine?" is the first question everyone asks before pulling a model with Ollama. The honest answer is a formula, not a yes or no. Here's how to estimate memory before you download 9GB you can't fit, plus what to actually expect for speed on the hardware you already own. A model's memory footprint is roughly: RAM = parameters in billions bytes per parameter + overhead Bytes per parameter depends on quantization more on that below . For the common Q4 quantization, figure about 0.55 to 0.65 GB per billion parameters once you include the KV cache and runtime overhead. Ollama's default quants land here. So a 7B model at Q4 needs roughly 7 0.6 ≈ 4.2GB , and in practice Ollama reports qwen2.5-coder:7b at 4.7GB on disk, which is close to what it occupies in memory. The overhead grows with your context window: a long prompt fills the KV cache and adds anywhere from a few hundred MB to a couple of GB. Plan for headroom, not a tight fit. Models are trained in 16-bit floats. Quantization shrinks each weight to fewer bits so the model fits in less memory. You trade a little quality for a lot of RAM. | Quant | Bits/param | ~GB per 1B params | Quality | |---|---|---|---| | FP16 | 16 | ~2.0 | Full, reference | | Q8 0 | 8 | ~1.1 | Nearly lossless | | Q5 K M | ~5.5 | ~0.75 | Very good | | Q4 K M | ~4.5 | ~0.6 | Good the sweet spot | | Q3 K M | ~3.5 | ~0.5 | Noticeable degradation | | Q2 K | ~2.5 | ~0.4 | Often too lossy to trust | The K M suffix means "K-quant, medium": a smarter scheme that keeps the important weights at higher precision and squeezes the rest. Q4 K M is the default for most Ollama models because it's the best balance: roughly a quarter of the FP16 size with quality most people can't distinguish in normal use. My take: don't go below Q4 unless you're desperate for space. The jump from Q4 to Q3 buys you a little RAM and costs you real coherence, especially on code. To pull a specific quant in Ollama: ollama pull qwen2.5-coder:7b-instruct-q4 K M ollama pull qwen2.5-coder:7b-instruct-q8 0 This is the part beginners miss. There are two kinds of memory that matter, and which one you have changes everything about speed. Ollama loads as much of the model as fits in VRAM and runs the rest on CPU. A model that's half in VRAM and half in RAM runs at roughly the speed of the slow half, so partial offload helps less than you'd hope. The goal is to fit the entire model in VRAM. That's why a $300 used 12GB GPU often beats a $2000 laptop with 64GB of RAM for inference: the RAM is plenty, but without VRAM the CPU is the bottleneck. Apple Silicon is the exception. Unified memory means the GPU and CPU share one fast pool, so an M-series Mac with 16GB or more punches well above a typical RAM-only PC. These are representative figures from my own machines and what I see consistently reported, not lab results. Treat them as "what to expect," plus or minus a chunk depending on your exact CPU, RAM speed, and GPU. CPU numbers assume a recent multi-core desktop/laptop chip; GPU numbers assume a mid-range card roughly an RTX 3060/4060 class, 8 to 12GB VRAM with the model fully offloaded. | Model | Params | Q4 size | RAM to run | CPU tok/s | Mid GPU tok/s | |---|---|---|---|---|---| | qwen2.5-coder:1.5b | 1.5B | ~1.0GB | 4GB+ | 15 to 30 | 80 to 130 | | mistral:7b | 7B | ~4.1GB | 8GB+ | 5 to 9 | 45 to 70 | | qwen2.5-coder:7b | 7B | ~4.7GB | 8GB+ | 5 to 9 | 45 to 70 | | llama3.1:8b | 8B | ~4.7GB | 8GB+ | 4 to 8 | 40 to 65 | | deepseek-coder-v2 | 16B MoE | ~8.9GB | 16GB+ | 8 to 14 | 50 to 80 | A few notes that matter: deepseek-coder-v2 is a mixture-of-experts model.You can run 7B models, but only just. The OS and browser already eat 3 to 4GB, so a 4.7GB model leaves you scraping. Realistically: qwen2.5-coder:1.5b ~1.0GB . Fast on CPU, leaves room for everything else. mistral:7b or qwen2.5-coder:7b , but close your browser tabs first and expect 5 to 9 tok/s.Keep your context window modest. A 16K context fills the KV cache and can push you over the edge on 8GB. This is the comfortable RAM-only tier and where most developers sit. qwen2.5-coder:7b ~4.7GB . Plenty of headroom, good code quality. deepseek-coder-v2 ~8.9GB fits with room to spare and runs respectably thanks to MoE.Speed is still CPU-bound here single digits to low teens tok/s , so use the 7B for "thinking" tasks and the 1.5B for anything interactive. Now it's a different machine. If you have 8 to 12GB of VRAM, every model in the table above fits fully on the GPU and flies. qwen2.5-coder:7b fully offloaded, 45 to 70 tok/s. Feels like a hosted API. deepseek-coder-v2 for harder code tasks, still fast.Check what Ollama is actually doing: ollama ps The PROCESSOR column tells you the split. 100% GPU is what you want. If it says 50%/50% CPU/GPU , your model is too big for VRAM and you're leaving speed on the table. Drop to a smaller quant or a smaller model until it's fully on the GPU. Before you pull anything: You need less RAM than you think and more VRAM than you have. A 16GB machine with no GPU runs 7B code models comfortably, which covers most real developer work. A cheap GPU with 8 to 12GB of VRAM matters more than doubling your system RAM, because it turns a tolerable tool into an instant one. Start with qwen2.5-coder:1.5b to learn the workflow, move to 7b when you want quality, and only chase bigger models once you've got the VRAM to fit them. Everything I build, including spectr-ai https://github.com/pavelEspitia/spectr-ai , runs fine on a 16GB box with the 7B model. Local is more capable than the hardware fear suggests.