InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

Researchers have developed InfoQuant, a training-free method that reshapes activation distributions in large language models to improve low-bit quantization efficiency. The approach, which uses Peak Suppression Orthogonal Transformation and adaptive outlier-token selection, preserves 97% of floating-point accuracy under W4A4KV4 quantization and reduces the LLaMA-2 13B performance gap by 42% over prior state-of-the-art methods. This advance addresses a key bottleneck in deploying LLMs with reduced memory and computational requirements.

arXiv:2605.26175v1 Announce Type: new Abstract: Low-bit activation quantization remains a major bottleneck in efficient large language model LLM deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low-bit uniform quantizer. Existing post-training quantization PTQ methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer-facing distribution design and analyze quantization error from an information-theoretic perspective. Our analysis shows that quantization-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train-free method that employs Peak Suppression Orthogonal Transformation PSOT to shape activations into more quantization-friendly distributions. We further introduce adaptive outlier-token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end-to-end training baselines. Under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% over the previous state of the art. Code is available at https://github.com/LLIKKE/InfoQuant https://github.com/LLIKKE/InfoQuant