GGUF vs. GPTQ vs. AWQ: The Plain-English Guide to LLM Quantization

GGUF, GPTQ, and AWQ are the three dominant formats for running quantized large language models locally, each optimized for different hardware and use cases. GGUF, the format used by llama.cpp and its derivatives, offers the most flexibility by supporting CPU, GPU, or mixed execution, making it the default choice for Mac users and anyone whose model does not fit entirely in VRAM. GPTQ is a GPU-native 4-bit standard that uses Hessian-based calibration to minimize quality loss, while AWQ provides an alternative approach focused on activation-aware weight quantization for improved efficiency on supported hardware.

If you have spent any time trying to run a large language model on your own machine, you have hit the same wall everyone does: the model is enormous and your VRAM is not. A 70-billion-parameter model in its native 16-bit precision wants about 140 GB of memory just to hold the weights. Almost nobody has that. Quantization is the trick that closes the gap — and it is also where the jargon avalanche begins. GGUF. GPTQ. AWQ. Q4 K M. NF4. EXL2. This guide is the version we wish existed when we started: what quantization actually does, what each of the major formats is really for, the honest trade-offs, and a decision table you can use in thirty seconds. No hand-waving, no assuming you already read the papers. What quantization actually is A model's "weights" are just a giant pile of numbers. By default each one is stored at 16-bit precision FP16 or BF16 — two bytes per weight. Quantization stores those same numbers using fewer bits : 8, 5, 4, sometimes as low as 2. Fewer bits per weight means a smaller file and less memory, at the cost of some precision. The memory math is refreshingly simple. Multiply the parameter count by the bytes per weight: FP16 16-bit : 2 bytes/weight → a 70B model needs ~140 GB 8-bit: ~1 byte/weight → ~70–75 GB 4-bit: ~0.5 byte/weight → ~40 GB That single jump from 16-bit to 4-bit is what turns "needs a data-center GPU" into "runs on a 48 GB card, or a unified-memory box." The surprising part — and the reason quantization is everywhere — is that a well-done 4-bit model is shockingly close in quality to the original. The degradation is real but small, and for most use it is invisible. We break the exact numbers down in our companion piece on how much VRAM you actually need for a 70B model https://vettedconsumer.com/how-much-vram-do-you-actually-need-to-run-a-70b-model-locally/ . The three big formats GGUF — the one most people should use GGUF is the file format used by llama.cpp and everything built on it: Ollama, LM Studio, Jan, KoboldCpp . It is the successor to the older GGML format. If you download a model from Hugging Face and the filename ends in .gguf , this is what you have. GGUF's superpower is flexibility . It runs on CPU, GPU, or a mix of both — you can offload as many layers to your GPU as fit and let the CPU handle the rest. That is why it is the default for Mac users Apple Silicon via Metal and for anyone whose model does not quite fit in VRAM. It also ships in a huge range of quantization levels, the "k-quants," which is where the cryptic suffixes come from: Q8 0 — ~8.5 bits/weight, essentially lossless. Use it when you have the memory and want zero compromise. Q6 K — ~6.6 bpw, near-indistinguishable from full precision. Q5 K M — ~5.7 bpw, a high-quality middle ground. Q4 K M — ~4.8 bpw. This is the community default and the sweet spot: about a 1% perplexity hit for roughly a third of the original size. Q3 K M — ~3.9 bpw. Noticeably more degraded, but usable when memory is tight. Q2 K — ~3.4 bpw. The "I just want it to load at all" tier. Quality drops meaningfully; treat it as a last resort. The letter suffix matters: M medium and S small trade a sliver of quality for size. The newer I-quants IQ4 XS, IQ3 M, etc. squeeze out a bit more quality per byte using importance-matrix calibration, at the cost of slightly slower inference on some hardware. Use GGUF if: you are on a Mac, you are mixing CPU and GPU, you want the widest model selection, or you simply want the path of least resistance. For most readers, the honest answer is "start here." GPTQ — the GPU-native 4-bit standard GPTQ is a post-training quantization method introduced by Frantar et al. in 2022 arXiv:2210.17323 https://arxiv.org/abs/2210.17323?ref=vettedconsumer.com , later presented at ICLR 2023 . Rather than naively rounding every weight, it uses approximate second-order Hessian information to quantize weights one column at a time while compensating for the error introduced — a one-shot process that runs in a few GPU-hours even for huge models. The practical point: GPTQ is weight-only, GPU-only, and fast at inference . It shines when the whole model fits in VRAM and you are serving it through a GPU runtime. It was the dominant 4-bit format on Hugging Face for a long time and is widely supported by serving stacks. Its weakness is that it does not gracefully spill to CPU the way GGUF does — it is an all-in-VRAM format. Use GPTQ if: your model fits entirely in GPU memory and you want a mature, well-supported 4-bit format for GPU serving. AWQ — the accuracy-focused challenger AWQ Activation-aware Weight Quantization comes from Lin et al. arXiv:2306.00978 https://arxiv.org/abs/2306.00978?ref=vettedconsumer.com , MLSys 2024 best paper . Its insight is clever: not all weights matter equally. A small fraction ~1% of "salient" weight channels — identified by looking at the activations flowing through them, not the weights themselves — carry an outsized share of the model's quality. AWQ protects those channels by scaling them before quantizing, so the important parts survive 4-bit compression nearly intact. In practice AWQ often matches or beats GPTQ on accuracy at the same bit width , and it is a first-class citizen in high-throughput serving engines like vLLM https://vettedconsumer.com/ollama-vs-lm-studio-vs-llama-cpp-which-local-llm-runtime-should-you-actually-use/ . Like GPTQ, it is GPU-oriented and assumes the model lives in VRAM. Use AWQ if: you are serving on a GPU especially via vLLM and want the best quality you can get at 4-bit. The honorable mentions bitsandbytes / NF4 — the 4-bit NormalFloat format from the QLoRA paper Dettmers et al., arXiv:2305.14314 https://arxiv.org/abs/2305.14314?ref=vettedconsumer.com . It is the go-to for fine-tuning quantized models on consumer GPUs, and it quantizes on the fly. Great for training; usually not your first pick for pure inference. EXL2 ExLlamaV2 — a GPU-only format with variable bitrate you can target, say, 4.65 bpw . Extremely fast and memory-efficient on NVIDIA cards; beloved by people optimizing for tokens-per-second on a single GPU. The decision table | Your situation | Use this | |---|---| | Mac / Apple Silicon | GGUF Q4 K M or higher | | Model is bigger than your VRAM | GGUF CPU+GPU offload | | Fits in one NVIDIA GPU, want max quality | AWQ | | High-throughput serving vLLM | AWQ or GPTQ | | Single-GPU, chasing raw speed | EXL2 | | Fine-tuning on a consumer card | bitsandbytes / NF4 | | "Just give me the safe default" | GGUF Q4 K M | Which bit level should you pick? This is the question that actually matters day to day. The consensus that has held up across countless community tests: Q4 K M is the default for a reason. It is the best quality-per-gigabyte for most people — roughly a 1% perplexity increase over full precision while cutting size by ~65%. If you have spare memory, step up to Q5 K M or Q6 K before you reach for a larger model at a lower quant. A 70B at Q4 generally beats a 70B at Q2. But a bigger model at moderate quant usually beats a smaller model at high quant. A 70B at Q4 K M will typically outperform a 13B at Q8. When in doubt, go bigger-model-lower-quant rather than smaller-model-higher-quant — down to about Q4. Below Q3, the math flips. Avoid Q2 unless you have no other choice. The quality cliff steepens fast below ~3 bits. Here is what those choices look like in real memory for a 70B model weights only; add headroom for context/KV cache : | Quant | ~Bits/weight | ~VRAM 70B | Verdict | |---|---|---|---| | FP16 | 16 | ~140 GB | Reference, rarely run locally | | Q8 0 | 8.5 | ~74 GB | Lossless, if you can afford it | | Q6 K | 6.6 | ~58 GB | Near-perfect | | Q5 K M | 5.7 | ~50 GB | Excellent | | Q4 K M | 4.8 | ~43 GB | Sweet spot | | Q3 K M | 3.9 | ~35 GB | Tight-memory option | | Q2 K | 3.4 | ~29 GB | Last resort | Why low-bit quantization hurts reasoning more than recall Here is the nuance the bit-level table hides: quantization does not degrade all tasks equally. The damage is uneven — and understanding why will keep you from picking the wrong quant for your actual workload. Mechanically, quantization adds a small amount of rounding noise to every weight. For a single-step task — recalling a fact, answering a multiple-choice question, finishing a sentence — that noise is usually swamped by the model's confidence; the right answer still wins. This is why aggressive quants often look almost lossless on short benchmarks like HellaSwag. The signal survives. For multi-step reasoning — chains of logic, code that has to compile, tool calls that must be formatted exactly, long-context tasks where attention has to track many details at once — that same small noise compounds . A tiny error in step one shifts step two, which shifts step three, and by the end the chain has drifted off course. The model did not "forget" anything; it accumulated rounding error across a long path. Long context makes it worse, because attention has more competing details to keep straight and low-bit weights blur the distinctions. This matches what owners report. The recurring consensus on r/LocalLLaMA is that 4-bit is fine for chat but worth a second thought before you trust it with precise, multi-step work: "Fine for chatting but don't use it for actual data work."— u/autonomousdev , after Q8 caught two edge cases in an extraction task that Q4 missed "Aggressive quants <Q6 do fine on benchmarks that often use 4k contexts. They fall apart in real life, when you need to handle 30k or more."— u/ClearApartment2627 The practical upshot: Casual chat, brainstorming, summarizing, simple Q&A: Q4 K M is genuinely hard to tell apart from full precision. Save the VRAM. Coding, agentic tool use, structured extraction, long documents: step up to Q5 K M or Q6 K if it fits. The extra bits buy reliability exactly where quantization noise does the most damage. Watch your KV cache, too. Quantizing the context cache to save memory is a separate knob from weight quantization — and at long context it often hurts more than the weight quant does. If a model falls apart past ~16–32k tokens, suspect KV-cache quantization before you blame the weights. None of this contradicts "Q4 K M is the default." It refines it: Q4 is the default for general use, and a deliberate step up is cheap insurance for the workloads where every step has to land. Sources & how we researched this This explainer synthesizes the original method papers — GPTQ Frantar et al., 2022 https://arxiv.org/abs/2210.17323?ref=vettedconsumer.com , AWQ Lin et al., 2023 https://arxiv.org/abs/2306.00978?ref=vettedconsumer.com , and QLoRA/NF4 Dettmers et al., 2023 https://arxiv.org/abs/2305.14314?ref=vettedconsumer.com — alongside the llama.cpp https://github.com/ggml-org/llama.cpp?ref=vettedconsumer.com documentation and quantization tables for the GGUF k-quant bits-per-weight figures. The bit-level guidance reflects the broad, durable consensus from the r/LocalLLaMA community and maintainer benchmarks, not first-hand testing on our part. VRAM figures are weights-only estimates rounded for clarity; real-world usage adds memory for the KV cache and context window, which we cover separately.