# Show HN: Local LLM Hardware Calculator

> Source: <https://vettedconsumer.com/can-i-run-it/>
> Published: 2026-06-21 11:47:21+00:00

📎 **Run a site or newsletter?** Use the **Cite** or **Embed** buttons just above to link to this tool or embed the live version on your own page, free, no signup, just keep the credit.

**One step earlier:** not sure you should buy hardware at all? Our [cost calculator](https://vettedconsumer.com/cost-calculator/) compares buying vs renting cloud GPUs vs just paying for an API, with break-even math for your usage.

**Two ways to use it:** leave "Your machine" empty to shop across everything we track, or pick the hardware you already own (or enter its memory) to get a personal verdict, including, when it doesn't fit, the exact quant, context, or KV-cache change that would make it fit.

## How the estimate works

The tool uses the same math from our guides, shown in the open because that's the point of this site. A model's memory cost has three parts:

**Weights**, parameters × bits-per-weight ÷ 8. A 70B model at Q4_K_M (~4.8 bits/weight) is about 42 GB. Quantization choices are covered in our[plain-English quantization guide](https://vettedconsumer.com/gguf-vs-gptq-vs-awq-the-plain-english-guide-to-llm-quantization-and-which-one-to-pick/).**KV cache**, grows with every token of context. We assume a GQA-typical attention shape and an FP16 cache; the KV-precision selector in the tool shows exactly what a Q8 or Q4 cache saves. Full math in[The KV cache, explained](https://vettedconsumer.com/tag/software-tools/).**Overhead**, a flat ~1.5 GB buffer for the runtime and activations.

For Mixture-of-Experts models, *memory* follows total parameters but *speed* follows active parameters, that's why a 120B MoE can be fast on a box that would crawl on a dense 70B. The one-line rule: **buy memory for the total, expect speed from the active** ([MoE, explained](https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/)).

The "gen ceiling" column is memory bandwidth ÷ bytes streamed per token, a theoretical upper bound from the fact that token generation is bandwidth-bound, not compute-bound ([why that is](https://vettedconsumer.com/tag/software-tools/)). Real speeds come in below it.

## Honest limits

These are estimates, not lab measurements. Real usage varies by runtime (llama.cpp vs vLLM vs MLX), KV-cache precision, batch settings, and model architecture. Unified-memory machines share RAM with the OS, so we subtract an 8 GB reserve; discrete GPUs lose ~1 GB to the desktop. When a result says "tight fit," believe it, within 10% of capacity means long context or background apps will push you over. Hardware listings come from our [methodology](https://vettedconsumer.com/methodology/); affiliate links never influence what appears or how it ranks.
