A Black Box Made Less Opaque (part 4)

A new analysis explores how weight quantization affects model performance and sparse autoencoder (SAE) reconstruction of the residual stream, using fraction of variance unexplained (FVU) as a key metric. The study, inspired by Google DeepMind's Gemma 4 models, finds that compression degrades both accuracy and interpretability, with implications for AI safety and reliability.

This is the fourth installment in a series of analyses exploring basic AI interpretability mechanics and techniques. While this analysis is designed to stand on its own, readers interested in a comparative analysis of representational geometry and the effects of manipulating feature activation will likely appreciate a review of part 1 https://www.lesswrong.com/posts/QRM3q9ZhLDZuxuDbz/a-black-box-made-less-opaque-part-1 , part 2 https://www.lesswrong.com/posts/Qnm6gAFnCPaJsbhSS/a-black-box-made-less-opaque-part-2 , and part 3 https://www.lesswrong.com/posts/2HdHD34QrzGazFJgZ/a-black-box-made-less-opaque-part-3-1 of this series. Readers are welcome to examine and reproduce this analysis via its colab https://colab.research.google.com/drive/19e3PJ4HDFbS1vYJmgXL7v-0gRKF4R39T?usp=sharing and supporting GitHub repo https://github.com/TheBuleGanteng/interpretability pt4 . This analysis constitutes the fourth installment in a multi-part series documenting my exploration of key concepts related to machine learning “ML” generally and mechanistic interpretability “MI” specifically. The most immediate and tangible objective of this series is to further my own understanding of ML principles and interpretability. More broadly, it is my sincere belief that the greatest inhibitor to fuller realization of AI’s positive societal benefit is not a deficit in model capabilities, an inability to run those models, or a lack of human imagination, but rather our still-rudimentary ability to understand and control model outputs. Any movement along this dimension resulting in more reliable and pliable models could have an outsized impact in increasing AI model use cases, adoption, and, by extension, societal benefit. My hope is to make a contribution, however small, to that effort. This installment seeks to illustrate the effects of compression on model performance and SAEs’ ability to reconstruct the model’s residual stream, as measured by the fraction of variance unexplained FVU . While a low FVU is not synonymous with interpretability - SAEs can faithfully reconstruct a model’s residual stream without correctly identifying features or their activation levels - it is a prerequisite for interpretability. For that reason, readers should consider it an important first step in better understanding the effect of model compression on interpretability. This line of analysis was largely inspired by Google DeepMind’s Gemma 4 models https://deepmind.google/models/gemma/gemma-4/ , which use an array of novel compression techniques. This article examines a more widely used compression methodology weight quantization as the first step in what I hope will be a series examining a range of compression techniques and their effects on model interpretability. Fundamentally, this installment’s analysis seeks to answer the following question: “ How does model weight quantization affect model performance and SAE ability to reconstruct the model’s residual stream? ”. In this analysis, model performance was assessed via cross-entropy and perplexity. Comparing the next-token probability distributions generated by the model vs. the actual text samples provided a sense of the predictive accuracy of the model under the various levels of compression tested. This assessment of model accuracy matters because model accuracy and interpretability are largely independent variables. Put another way, a one-dimensional analysis of the effects of model weight quantization on interpretability is informative, but the practical implications of that analysis will vary depending on the model’s performance under those same levels of compression. An inaccurate-but-interpretable model, for example, carries far different practical safety implications compared to a model that performs well under weight quantization, despite a severely degraded level of interpretability. The second avenue of analysis undertaken in this article relates to how well the model’s official pretrained residual stream SAEs are able to reconstruct the model’s vectors at various levels of weight quantization. This is primarily measured by FVU, which represents the difference between the vectors encoded-then-decoded by the SAEs vs. those vectors originally created by the model. As a reminder, SAEs are tools that create “simplified” copies of the model’s residual stream vectors. By avoiding the use of “superposition” in which a single dimension of the vector is associated with multiple, unrelated concepts, SAE vectors allow for a more straightforward, ideally 1:1, relationship between the vector’s dimensions and the model’s features. This process is referred to as “encoding” the residual stream vector. After encoding the residual stream vector, this experiment then reconstructs those expanded vectors “decoding” and measures that reconstructed vector’s composition against the original residual stream vector. The degree of mismatch between the encoded-then-decoded SAE vector and the original model’s vector is called FVU. It provides insight into the SAE’s ability to faithfully process the model’s vector and potentially but not necessarily accurately identify model features via the expanded version of the vector. Figure 1: Illustration of SAE encoding and decoding To explain by analogy, assume one wanted to assess whether an assistant correctly summarized a document. One way to do this is to take the document’s summary and then use the information contained within to recreate the original document. If the document recreated from the summary contained all the same information as the original document, then one could conclude that the summary captured all the relevant information from the original document. The encoding-then-decoding process undertaken in this experiment follows similar logic. The methodology employed in this analysis uses two relatively modern open-source models from Google DeepMind: Gemma 3 4B and Gemma 3 12B https://deepmind.google/models/gemma/gemma-3/ , sourced via Hugging Face https://huggingface.co/collections/google/gemma-3-release . As usual, I sourced the corresponding pretrained residual stream SAEs from SAELens https://decoderesearch.github.io/SAELens/latest/ . I examined the residual stream at layers 17 and 24, for Gemma 3 4B and Gemma 3 12B, respectively, as they represent similar relative positions within each model’s total layer count. To compress the model weights, I used the bitsandbytes library https://huggingface.co/docs/bitsandbytes/en/index to compress each model’s weights into 8-bit and 4-bit format, representing common compression levels in open-weight models. The 8-bit compression utilized bitsandbytes’ LLM.int8 method and the 4-bit compression utilized bitsandbytes’ NF4 datatype. Both compression methodologies were applied as post-training quantization to the bf16 checkpoint i.e., the uncompressed versions of each model at load time. Finally, I used NeelNanda/pile-10k https://huggingface.co/datasets/NeelNanda/pile-10k , a Pile slice prepared for interpretability work, which I packed into fixed-length token chunks shuffled with a fixed seed for determinism and reproducibility. Since the SAEs used in this experiment were trained using activations resulting from this kind of text running through the models, the use of that text here is meant to provide a clean measurement of the SAEs' capabilities, relative to the compression methods undertaken. Model accuracy and SAE reconstruction were each computed for each model at every bit-width bf16, 8-bit, and 4-bit compression level. To further strengthen the methodologies employed and the resulting conclusions, I also undertook the following validation tests: Table 1: Tests performed | | | | Primary | Model accuracy | Cross-entropy / perplexity | | Primary | SAE applicability | FVU headline | | Validation | Sample-size stabilization | FVU snapshotted at increasing token counts | | Validation | Noise floor | Activation perturbation ÷ SAE reconstruction residual | | Table 2: Summary of results | | | | Primary | Model accuracy | Cross-entropy rise vs. bf16: | Quantization degrades next-token performance only modestly ~2.0-2.7% at 4-bit . Slight improvement in the 12B model under 8-bit compression is likely noise. | Primary | SAE reconstruction | FVU by bit-width | FVU stays essentially flat across compression levels, so the pretrained SAEs reconstruct the residual stream about as well on the compressed models as on the originals. | Validation | Sample-size stabilization | FVU stabilizes by ~200k, with the 200k and 500k snapshots within 0.0001. Thus, 200k is the budget used for the headline results. | The chosen token budget sits comfortably in the stable regime, so the headline FVU/accuracy numbers aren't sample-size artifacts. | Validation | Noise floor | Perturbation ÷ SAE residual = ~1.1-1.9; verdict: flat FVU results are informative | Quantization genuinely perturbs the residual stream by an amount the SAE nonetheless absorbs. This means the flat FVU is a real result, not a change too small to detect. | The first question explored by this analysis is how increasing levels of compression affect model performance. This was measured using cross-entropy https://huggingface.co/blog/royswastik/cross-entropy-simplified and its exponential, perplexity - standard means of measuring a model’s next-token prediction accuracy. The results from this experiment strongly suggest that the models are highly resilient to compression, demonstrating only modest declines in performance when subjected to 4-bit model weight quantization. More specifically, 8-bit compression yielded almost no change under 0.6% in cross-entropy, compared to the uncompressed bf16 models. 4-bit compression showed still-modest but slightly greater degradation in performance ~2-2.7% . Figure 2: Model performance vs. compression level These results are both expected and reassuring. With the use of weight-quantized models becoming the de facto choice in most open-source ML applications, modern models tend to quantize gracefully, without significant declines in performance. The likely reason is mechanical: quantization perturbs each weight only slightly, and that small, distributed error tends to wash out across the model's many parameters rather than compounding into a large output change. This echoes a broader theme from Part 3 https://www.lesswrong.com/posts/2HdHD34QrzGazFJgZ/a-black-box-made-less-opaque-part-3-1 in this series: that these models can, in some circumstances, absorb meaningful perturbations to their internal state with little change to their output. The second question explored by this analysis is how increasing levels of compression affect the applicability of each model’s pretrained residual stream SAEs. This SAE applicability was approximated by assessing the SAEs’ ability to take the expanded encoded versions of each model’s vectors and use that information to reconstruct the vectors originally produced by the model decoded . If the decoded vectors produced by the SAEs closely match those originally produced by the compressed model, this suggests that the SAEs are able to faithfully encode and decode the compressed model’s vectors - a prerequisite for useful feature identification and model interpretability. To measure this, I employed FVU https://en.wikipedia.org/wiki/Fraction of variance unexplained , which measures the degree of difference between the SAE’s decoded vectors and those produced by the model. The results of this experiment demonstrate that the SAEs’ ability to reconstruct the residual stream vectors was nearly unaffected by increasing levels of compression. For both Gemma 3 4B and Gemma 3 12B, FVU increased only ~0.0002–0.0003 at 4-bit compression and was effectively flat or slightly lower at 8-bit compression . Figure 3: FVU vs. compression level To further strengthen these results and ensure that the low and stable FVU values observed were genuine reflections of the SAE’s ability to handle a model whose weights were changed perturbed by compression, I conducted a noise-floor test which compares how much compression affects a model’s vectors the numerator to the SAE’s reconstruction error on the uncompressed model. A ratio above 1 indicates that compression induced changes to the model’s vectors that were larger than the SAE’s precision and thus any failures to reconstruct the model’s residual stream would have been evident via increases in FVU. That those floor ratios ranged from ~1.10 to 1.87 confirms compression genuinely moved the residual stream by more than the SAE's own margin of error, so the stable FVU reflects the SAE absorbing a real perturbation rather than a change too small to detect. Figure 4: Noise-floor testing The experiments contained in this analysis demonstrate that at commonly used levels of model weight quantization, SAEs retain their ability to faithfully encode and decode the residual stream vectors produced by the models tested, as evidenced by their essentially constant FVUs even as the noise-floor test confirmed compression genuinely perturbed the residual stream. While this does not prove the interpretability of compressed models writ large, it is a prerequisite for the feature identification on which interpretability depends, and thus is cause for optimism in the ability to understand and control the compressed models that are increasingly the default in open-source applications. For model developers and researchers, this means that the interpretability “safety tax” the cost of building interpretability tooling may be overstated. These experiments showed that SAEs trained on the uncompressed bf16 versions of Gemma 3 4B and Gemma 3 12B reconstructed those models' compressed variants about as well as the originals. This suggests that model developers need not train SAEs for every conceivable level of quantization, but instead can reuse the same SAEs developed for the base, uncompressed model, saving time and money while lowering the barrier for interpretability-related work. For society generally, the broader benefits of these findings are clear: compression remains a viable and valuable tool that allows models to run more economically on more constrained hardware, while still preserving the tools needed to interpret those models and more reliably control their behavior. Put another way, these results are supportive of the conclusion that model controllability and hardware flexibility are not in tension. One clear area of further research would be to close the logical loop that allows for a clear declaration as to the effect of model weight quantization on interpretability. These experiments take an important step in that direction by confirming SAE ability to faithfully reconstruct a model’s residual stream. While that reconstruction is a prerequisite for useful feature identification, it is only part of the story. Only by actually comparing the features identified by SAEs in uncompressed vs. compressed models can one make definitive conclusions about the effect of compression on interpretability. This stands as a logical next step in continuation of this research. Another logical extension would be into other forms of compression. This analysis covered post-training weight quantization, but that represents only one compression approach. Google DeepMind’s Gemma 4 https://deepmind.google/models/gemma/gemma-4/ , for instance, ships quantization-aware-trained checkpoints and non-weight compression techniques such as KV-cache compression e.g., TurboQuant https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/ , among other means of reducing the model’s hardware requirements. Whether SAE reconstruction holds up under these more varied and aggressive methods remains an open question.