{"slug": "how-i-implemented-gptq-from-scratch-and-what-i-learned", "title": "How I Implemented GPTQ from Scratch (and What I Learned)", "summary": "A developer implemented GPTQ quantization from scratch on a nanoGPT model, achieving only 1.1% perplexity degradation across 61 quantized layers. The implementation uses second-order optimization to redistribute quantization errors, preserving model accuracy better than naive post-training quantization.", "body_md": "I implemented GPTQ from scratch on a nanoGPT model and got only 1.1% perplexity degradation across 61 quantized layers. Here's exactly how it works and what I built.\n\nQuantization is one of the simplest and most effective ways to reduce the cost of running neural networks. Instead of storing weights in 32-bit floating point format, we reduce them to lower precision like INT8 or INT4. This reduces memory usage and can significantly speed up inference on hardware that supports low-precision arithmetic.\n\nThe simplest approach is Post-Training Quantization (PTQ), where each weight is independently rounded to the nearest quantized value. While this is fast and easy to implement, it ignores an important fact: neural network weights are not independent. Each weight contributes to a shared output, and small perturbations in one weight can interact with others in non-trivial ways.\n\nBecause of this, naive PTQ often introduces noticeable accuracy degradation. Some layers are extremely sensitive, and uniform rounding treats all weights as equally important. In practice, this leads to compounding errors across layers, especially in transformers where representations are tightly coupled.\n\nThis is the core problem: we need a way to quantize weights that respects the structure of the network, not just their individual magnitudes.\n\nGPTQ approaches quantization as a **local optimization problem per layer**, rather than a simple rounding operation. Instead of treating each weight independently, it asks: *how does changing this weight affect the layer’s output loss, and how should we compensate for it?*\n\nTo answer this, GPTQ uses a second-order approximation of the loss landscape. The key object is the Hessian matrix, which measures how sensitive the loss is to changes in each pair of weights. Intuitively, it tells us which directions in weight space are “steep” and which are “flat.”\n\nInstead of explicitly computing gradients for every parameter interaction (which is expensive), GPTQ approximates the Hessian using calibration data. This gives us a compact representation of how perturbations in one weight influence others.\n\nThe key idea is error propagation. When a weight column is quantized, it introduces an error. Instead of leaving that error isolated, GPTQ distributes it across the remaining unquantized weights in proportion to their sensitivity. This prevents error accumulation and preserves the layer’s output behaviour much more closely than naive rounding.\n\nThe first step was collecting activation statistics for each linear layer. I used forward hooks in PyTorch to capture the inputs arriving at every `nn.Linear`\n\nlayer during a calibration phase. This gave me a dataset of representative activations without modifying the model’s forward pass.\n\nFrom these activations, I constructed an approximation of the Hessian matrix using:\n\nH = 2 * Xᵀ X\n\nwhere (X) is the matrix of collected inputs. This step is critical because it encodes the geometry of how inputs interact with each weight column. Without calibration data, the model has no notion of which directions in weight space matter most.\n\nOnce the Hessian was computed, I added damping for numerical stability and inverted it using Cholesky decomposition. This inverse Hessian is what allows GPTQ to propagate error across columns efficiently.\n\nThe core quantization loop processes each weight column sequentially. For each column, I quantize it, compute the resulting error, and then adjust the remaining unquantized columns to compensate. The update rule is:\n\n```\nW[:, j+1:] -= err_j * H_inv[j, j+1:] / H_inv[j, j]\n```\n\nThis line is the heart of GPTQ. It ensures that the error introduced by quantizing one column is redistributed according to the curvature of the loss landscape, rather than accumulating locally.\n\nAfter applying GPTQ to a nanoGPT-style model with 61 linear layers, the degradation in performance was surprisingly small given the simplicity of the implementation.\n\n| Model | Loss | Perplexity |\n|---|---|---|\n| Baseline | 1.8521 | 6.37 |\n| GPTQ | 1.8623 | 6.44 |\n\nThe increase in loss corresponds to roughly a 1.1% degradation across the full network. Considering that every linear layer was quantized independently using only 10 calibration batches, this is a strong result. It shows that second-order information is highly effective at preserving model structure even under aggressive compression.\n\nThis implementation is correct algorithmically, but it is not optimized for production use. All quantized weights are still stored in float32 after dequantization, meaning there is no actual memory savings at runtime. Production GPTQ implementations store weights in INT4 or INT8 format and dequantize on the fly during inference.\n\nAnother improvement would be per-channel or group-wise quantization instead of per-column quantization. This would reduce variance across weight distributions and likely improve accuracy further, especially in deeper transformer layers where activation statistics vary significantly.\n\nFinally, more sophisticated calibration strategies (larger datasets, better sampling) would improve Hessian estimation and reduce approximation error.\n\nYou can find the full implementation, including ptq.py, gptq.py, and benchmark results, at:\n\n[https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026/tree/main/phase2_content/quantization](https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026/tree/main/phase2_content/quantization)", "url": "https://wpnews.pro/news/how-i-implemented-gptq-from-scratch-and-what-i-learned", "canonical_source": "https://dev.to/thokozani_buthelezi_2cd41/how-i-implemented-gptq-from-scratch-and-what-i-learned-39d9", "published_at": "2026-06-27 10:53:17+00:00", "updated_at": "2026-06-27 11:03:58.498036+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "large-language-models", "ai-research", "developer-tools"], "entities": ["GPTQ", "nanoGPT", "PyTorch"], "alternates": {"html": "https://wpnews.pro/news/how-i-implemented-gptq-from-scratch-and-what-i-learned", "markdown": "https://wpnews.pro/news/how-i-implemented-gptq-from-scratch-and-what-i-learned.md", "text": "https://wpnews.pro/news/how-i-implemented-gptq-from-scratch-and-what-i-learned.txt", "jsonld": "https://wpnews.pro/news/how-i-implemented-gptq-from-scratch-and-what-i-learned.jsonld"}}