# GGUF vs. GPTQ vs. AWQ: The Plain-English Guide to LLM Quantization

> Source: <https://vettedconsumer.com/gguf-vs-gptq-vs-awq-the-plain-english-guide-to-llm-quantization-and-which-one-to-pick/>
> Published: 2026-06-06 14:58:11+00:00

If you have spent any time trying to run a large language model on your own machine, you have hit the same wall everyone does: the model is enormous and your VRAM is not. A 70-billion-parameter model in its native 16-bit precision wants about **140 GB** of memory just to hold the weights. Almost nobody has that. Quantization is the trick that closes the gap — and it is also where the jargon avalanche begins. GGUF. GPTQ. AWQ. Q4_K_M. NF4. EXL2.

This guide is the version we wish existed when we started: what quantization actually does, what each of the major formats is really for, the honest trade-offs, and a decision table you can use in thirty seconds. No hand-waving, no assuming you already read the papers.

## What quantization actually is

A model's "weights" are just a giant pile of numbers. By default each one is stored at 16-bit precision (FP16 or BF16) — two bytes per weight. Quantization stores those same numbers using *fewer bits*: 8, 5, 4, sometimes as low as 2. Fewer bits per weight means a smaller file and less memory, at the cost of some precision.

The memory math is refreshingly simple. Multiply the parameter count by the bytes per weight:

**FP16 (16-bit):** 2 bytes/weight → a 70B model needs ~140 GB**8-bit:**~1 byte/weight → ~70–75 GB** 4-bit:**~0.5 byte/weight → ~40 GB

That single jump from 16-bit to 4-bit is what turns "needs a data-center GPU" into "runs on a 48 GB card, or a unified-memory box." The surprising part — and the reason quantization is everywhere — is that a well-done 4-bit model is **shockingly close in quality** to the original. The degradation is real but small, and for most use it is invisible. We break the exact numbers down in our companion piece on [how much VRAM you actually need for a 70B model](https://vettedconsumer.com/how-much-vram-do-you-actually-need-to-run-a-70b-model-locally/).

## The three big formats

### GGUF — the one most people should use

GGUF is the file format used by **llama.cpp** (and everything built on it: Ollama, LM Studio, Jan, KoboldCpp). It is the successor to the older GGML format. If you download a model from Hugging Face and the filename ends in `.gguf`

, this is what you have.

GGUF's superpower is **flexibility**. It runs on CPU, GPU, or a mix of both — you can offload as many layers to your GPU as fit and let the CPU handle the rest. That is why it is the default for Mac users (Apple Silicon via Metal) and for anyone whose model does not quite fit in VRAM. It also ships in a huge range of quantization levels, the "k-quants," which is where the cryptic suffixes come from:

**Q8_0**— ~8.5 bits/weight, essentially lossless. Use it when you have the memory and want zero compromise.** Q6_K**— ~6.6 bpw, near-indistinguishable from full precision.** Q5_K_M**— ~5.7 bpw, a high-quality middle ground.** Q4_K_M**— ~4.8 bpw. This is the** community default**and the sweet spot: about a 1% perplexity hit for roughly a third of the original size.** Q3_K_M**— ~3.9 bpw. Noticeably more degraded, but usable when memory is tight.** Q2_K**— ~3.4 bpw. The "I just want it to load at all" tier. Quality drops meaningfully; treat it as a last resort.

The letter suffix matters: `_M`

(medium) and `_S`

(small) trade a sliver of quality for size. The newer **I-quants** (IQ4_XS, IQ3_M, etc.) squeeze out a bit more quality per byte using importance-matrix calibration, at the cost of slightly slower inference on some hardware.

**Use GGUF if:** you are on a Mac, you are mixing CPU and GPU, you want the widest model selection, or you simply want the path of least resistance. For most readers, the honest answer is "start here."

### GPTQ — the GPU-native 4-bit standard

GPTQ is a post-training quantization method introduced by Frantar et al. in 2022 ([arXiv:2210.17323](https://arxiv.org/abs/2210.17323?ref=vettedconsumer.com), later presented at ICLR 2023). Rather than naively rounding every weight, it uses approximate second-order (Hessian) information to quantize weights one column at a time while compensating for the error introduced — a one-shot process that runs in a few GPU-hours even for huge models.

The practical point: GPTQ is **weight-only, GPU-only, and fast at inference**. It shines when the whole model fits in VRAM and you are serving it through a GPU runtime. It was the dominant 4-bit format on Hugging Face for a long time and is widely supported by serving stacks. Its weakness is that it does not gracefully spill to CPU the way GGUF does — it is an all-in-VRAM format.

**Use GPTQ if:** your model fits entirely in GPU memory and you want a mature, well-supported 4-bit format for GPU serving.

### AWQ — the accuracy-focused challenger

AWQ (Activation-aware Weight Quantization) comes from Lin et al. ([arXiv:2306.00978](https://arxiv.org/abs/2306.00978?ref=vettedconsumer.com), MLSys 2024 best paper). Its insight is clever: not all weights matter equally. A small fraction (~1%) of "salient" weight channels — identified by looking at the *activations* flowing through them, not the weights themselves — carry an outsized share of the model's quality. AWQ protects those channels by scaling them before quantizing, so the important parts survive 4-bit compression nearly intact.

In practice AWQ often **matches or beats GPTQ on accuracy at the same bit width**, and it is a first-class citizen in high-throughput serving engines like [vLLM](https://vettedconsumer.com/ollama-vs-lm-studio-vs-llama-cpp-which-local-llm-runtime-should-you-actually-use/). Like GPTQ, it is GPU-oriented and assumes the model lives in VRAM.

**Use AWQ if:** you are serving on a GPU (especially via vLLM) and want the best quality you can get at 4-bit.

## The honorable mentions

**bitsandbytes / NF4**— the 4-bit NormalFloat format from the QLoRA paper (Dettmers et al.,[arXiv:2305.14314](https://arxiv.org/abs/2305.14314?ref=vettedconsumer.com)). It is the go-to for*fine-tuning*quantized models on consumer GPUs, and it quantizes on the fly. Great for training; usually not your first pick for pure inference.**EXL2 (ExLlamaV2)**— a GPU-only format with*variable*bitrate (you can target, say, 4.65 bpw). Extremely fast and memory-efficient on NVIDIA cards; beloved by people optimizing for tokens-per-second on a single GPU.

## The decision table

| Your situation | Use this |
|---|---|
| Mac / Apple Silicon | GGUF (Q4_K_M or higher) |
| Model is bigger than your VRAM | GGUF (CPU+GPU offload) |
| Fits in one NVIDIA GPU, want max quality | AWQ |
| High-throughput serving (vLLM) | AWQ or GPTQ |
| Single-GPU, chasing raw speed | EXL2 |
| Fine-tuning on a consumer card | bitsandbytes / NF4 |
| "Just give me the safe default" | GGUF Q4_K_M |

## Which bit level should you pick?

This is the question that actually matters day to day. The consensus that has held up across countless community tests:

**Q4_K_M is the default for a reason.** It is the best quality-per-gigabyte for most people — roughly a 1% perplexity increase over full precision while cutting size by ~65%.**If you have spare memory, step up** to Q5_K_M or Q6_K before you reach for a*larger*model at a*lower*quant. A 70B at Q4 generally beats a 70B at Q2.**But a bigger model at moderate quant usually beats a smaller model at high quant.** A 70B at Q4_K_M will typically outperform a 13B at Q8. When in doubt, go bigger-model-lower-quant rather than smaller-model-higher-quant — down to about Q4. Below Q3, the math flips.**Avoid Q2 unless you have no other choice.** The quality cliff steepens fast below ~3 bits.

Here is what those choices look like in real memory for a 70B model (weights only; add headroom for context/KV cache):

| Quant | ~Bits/weight | ~VRAM (70B) | Verdict |
|---|---|---|---|
| FP16 | 16 | ~140 GB | Reference, rarely run locally |
| Q8_0 | 8.5 | ~74 GB | Lossless, if you can afford it |
| Q6_K | 6.6 | ~58 GB | Near-perfect |
| Q5_K_M | 5.7 | ~50 GB | Excellent |
| Q4_K_M | 4.8 | ~43 GB | Sweet spot |
| Q3_K_M | 3.9 | ~35 GB | Tight-memory option |
| Q2_K | 3.4 | ~29 GB | Last resort |

## Why low-bit quantization hurts reasoning more than recall

Here is the nuance the bit-level table hides: quantization does not degrade *all* tasks equally. The damage is uneven — and understanding why will keep you from picking the wrong quant for your actual workload.

Mechanically, quantization adds a small amount of rounding noise to every weight. For a **single-step task** — recalling a fact, answering a multiple-choice question, finishing a sentence — that noise is usually swamped by the model's confidence; the right answer still wins. This is why aggressive quants often look almost lossless on short benchmarks like HellaSwag. The signal survives.

For **multi-step reasoning** — chains of logic, code that has to compile, tool calls that must be formatted exactly, long-context tasks where attention has to track many details at once — that same small noise *compounds*. A tiny error in step one shifts step two, which shifts step three, and by the end the chain has drifted off course. The model did not "forget" anything; it accumulated rounding error across a long path. Long context makes it worse, because attention has more competing details to keep straight and low-bit weights blur the distinctions.

This matches what owners report. The recurring consensus on r/LocalLLaMA is that 4-bit is fine for chat but worth a second thought before you trust it with precise, multi-step work:

"Fine for chatting but don't use it for actual data work."— u/autonomousdev_, after Q8 caught two edge cases in an extraction task that Q4 missed

"Aggressive quants (<Q6) do fine on benchmarks that often use 4k contexts. They fall apart in real life, when you need to handle 30k or more."— u/ClearApartment2627

The practical upshot:

**Casual chat, brainstorming, summarizing, simple Q&A:** Q4_K_M is genuinely hard to tell apart from full precision. Save the VRAM.**Coding, agentic tool use, structured extraction, long documents:** step up to Q5_K_M or Q6_K if it fits. The extra bits buy reliability exactly where quantization noise does the most damage.**Watch your KV cache, too.** Quantizing the context cache to save memory is a*separate*knob from weight quantization — and at long context it often hurts more than the weight quant does. If a model falls apart past ~16–32k tokens, suspect KV-cache quantization before you blame the weights.

None of this contradicts "Q4_K_M is the default." It refines it: Q4 is the default for general use, and a deliberate step up is cheap insurance for the workloads where every step has to land.

## Sources & how we researched this

This explainer synthesizes the original method papers — GPTQ ([Frantar et al., 2022](https://arxiv.org/abs/2210.17323?ref=vettedconsumer.com)), AWQ ([Lin et al., 2023](https://arxiv.org/abs/2306.00978?ref=vettedconsumer.com)), and QLoRA/NF4 ([Dettmers et al., 2023](https://arxiv.org/abs/2305.14314?ref=vettedconsumer.com)) — alongside the [llama.cpp](https://github.com/ggml-org/llama.cpp?ref=vettedconsumer.com) documentation and quantization tables for the GGUF k-quant bits-per-weight figures. The bit-level guidance reflects the broad, durable consensus from the r/LocalLLaMA community and maintainer benchmarks, not first-hand testing on our part. VRAM figures are weights-only estimates rounded for clarity; real-world usage adds memory for the KV cache and context window, which we cover separately.
