{"slug": "why-do-transformers-have-outliers", "title": "Why do transformers have outliers?", "summary": "Transformer models develop outlier channels—feature dimensions with unusually large values in weights and activations—due to the softmax normalization in attention layers, which forces tokens to assign importance even when no update is needed, leading to concentrated values that make models sensitive to compression techniques like pruning and quantization.", "body_md": "Modern Machine Learning models are trained with a large number of parameters, often too large, and this overparameterization is very useful during training as it creates a vast search space for the model to encode rich representations from data into its parameters. However, as it turns out, models do not use all these parameters efficiently, creating a lot of redundancy.\n\nTraining is only a short stage in a model’s life cycle and the model spends majority of its time inferencing, where this overparameterization leads to excessive memory usage, slower response times, and increased power utilization. This redundancy is precisely why model compression techniques like pruning, quantization, or low-rank decomposition can often be applied with little loss in accuracy.\n\nTransformer models, however, contain unusually large values concentrated in certain feature dimensions of both weights and activations. These dimensions, known as *outlier channels*, are highly sensitive to model-compression. In this blog post, we will build intuition for why outlier channels emerge in transformer models and what drives their formation.\n\n## How Transformer Models Work[¶](#how-transformer-models-work)\n\n*Source: Mayank Pratap Singh: The need of Attention Mechanism*\n\nTransformers convert input into tokens (numbers), where each token is represented as a vector called an embedding. In Vision Transformers (ViTs), images are likewise split into patches, each projected into an embedding. As these representations pass through the model, the goal is to make them increasingly rich by incorporating contextual information from other tokens. Attention blocks play a central role in this contextual refinement.\n\nEach attention block contains multiple attention heads, where different heads capture different kinds of relationships between tokens. Modern transformer models stack many such attention blocks, continually refining token representations by injecting more and more contextual information. This structure is what allows transformers to model sequences so effectively and make strong predictions.\n\nHowever, this continous refinement of token representations is also where our problem begins.\n\nThink about it for a moment: do all tokens actually contain equal amounts of contextual information or require the same degree of updating? Consider punctuation tokens(. , / ? ), or special tokens such as beginning-of-sequence (BOS) or delimiter tokens (SEP). Similarly, in Vision Transformers (ViTs), many patches may contain very little useful information (e.g background regions like sky, walls, or empty spaces). Even semantically rich tokens may become “good enough” after passing through only a few attention blocks and may not require substantial further updates.\n\nThis naturally leads to an important question:\n\nDoes the transformer architecture allow token representations to stop updating once they become sufficiently good?\n\n## The Softmax Problem[¶](#the-softmax-problem)\n\nRecall how attention is computed. Each token embedding is first projected into **queries**, **keys**, and **values**. The query of each token is then matched against all keys via dot products to produce similarity scores. These scores are normalized using a softmax, yielding attention weights (values between 0 and 1) that determine how much each token should contribute to the output. Finally, these weights are used to compute a weighted sum of the value vectors, producing the updated token representations.\n*Source: Bobby Cheng: The Why, What and Where of Transformers*\n\nRecall that the weights need to sum up to 1 (because of how softmax works). That’s the heart of the problem. The attention layer cannot simply “do nothing”. It must assign some amount of importance somewhere.\n\nYou might imagine a simple workaround: a token just focuses entirely on itself and ignores everything else. Unfortunately, this wouldn’t work, because the attention output is always combined back with the original representation through the *residual connection*.\n\n## The Residual Connection Problem[¶](#the-residual-connection-problem)\n\nRecall that in GPT-style models (pre-attention layer norm):\n\nThe “no-update” in this structure means attention does not need to preserve ( x ). It only needs to satisfy:\n\nso that:\n\nSo, how does it work?\n\n## The Hack that Transformers Use[¶](#the-hack-that-transformers-use)\n\nRecall that:\n\nWhere:\n\n- \\(\\alpha_{ij}\\): attention weight from token (i) to token (j).\n- \\(x_j\\): hidden representation.\n- \\(V\\): value projection matrix.\n\nWe want:\n\nSince the alphas cannot be all zeros, transformer models tend to assign large weights for tokens which have near-zero value vectors. These are generally tokens with low-semantic content e.g BOS tokens or delimiter tokens.\n\nSo, for certain tokens, if:\n\nAnd if attention concentrates on such a token:\n\nthen:\n\nThe authors of the ICLR paper [ Efficient Streaming Language Models with Attention Sinks ](https://arxiv.org/pdf/2309.17453) further found out that these are generally initial tokens & called them “Attention Sinks”. They hypothesized that, because autoregressive transformers allow every token to attend to earlier tokens, the initial tokens (e.g BOS token) are visible to the entire sequence and therefore become natural accumulation points for excess attention mass. As a result, these early tokens frequently emerge as stable **attention sinks** across many transformer models.\n\nBut, what does all of this have to do with outliers?\n\n## The Chain Reaction[¶](#the-chain-reaction)\n\nWe’ve established that for a token to skip its update, attention must concentrate almost entirely on a token with a near-zero value projection. This requires softmax to produce an almost one-hot distribution.\n\nBut how does softmax produce an output close to 1 for one entry and close to 0 for the rest?\n\nFrom the softmax equation, it is easy to see that this requires the input logits to have a very high dynamic range. Something like the middle row in the following example:\n\nMathematically, we want:\n\nwhere:\n\n-\n\\(q_i\\): query vector for token (i)\n\n-\n\\(k_j\\): key vector for token (j)\n\n-\n. : dot product\n\nIntuitively, a token that wants to suppress its update must match very strongly with an attention sink token while matching weakly with all other tokens.\n\nFrom the definitions:\n\n(Subscripts (i, j, t) index tokens, and (x)’s come from LayerNorm.)\n\nSubstituting:\n\nUsing the dot product identity:\n\nwe get:\n\nCanceling the common term \\(\\|W_q x_i\\|\\):\n\nThis inequality can be satisfied along two axes: by making the left-hand side large, or by keeping the right-hand side small. To understand what each requires, we need to answer two questions: what does LayerNorm do to token representations & how to interpret a matrix-vector product?\n\n### Normalization[¶](#normalization)\n\nRecall that LayerNorm normalizes each token using the mean and standard deviation computed across its embedding dimension, and then applies a learned per-dimension scale and shift parameterized by \\(\\gamma\\) and \\(\\beta\\), respectively:\n\nThe squared norm of the output without \\(\\gamma, \\beta\\) is:\n\nsince\n\nby definition.\n\nSo every token after LayerNorm has norm exactly \\(\\sqrt{d}\\), i.e they all lie on a hypersphere of radius \\(\\sqrt{d}\\). With learnable \\(\\gamma\\), all tokens still lie on the same hypersphere but the radius gets scaled by \\(\\|\\gamma\\|\\). \\(\\beta\\) slightly perturbs this, but its effect is typically small & the key point holds; magnitude differences across tokens are erased.\n\nWe cannot make \\(\\|W_k x_j\\|\\) large simply because the sink token’s representation has a large norm going in. Any difference in output norm must therefore come entirely from directional differences. Let’s unpack this next.\n\n### What a linear map does to a vector[¶](#what-a-linear-map-does-to-a-vector)\n\nA linear transformation\n\ncan **always** be decomposed (via SVD) into three steps:\n\nwhere:\n\n-\n\\(V^\\top \\in \\mathbb{R}^{d_{\\text{in}} \\times d_{\\text{in}}}\\) is an orthogonal rotation\n\n(it changes directions but not lengths or angles). -\n\\(S\\) is a diagonal matrix of non-negative singular values\n\nIt stretches/compresses each axis independently.\n\n- \\(U \\in \\mathbb{R}^{d_{\\text{out}} \\times d_{\\text{out}}}\\) is a second rotation, giving the final orientation.\n\nThus, applying \\(W\\) to a vector \\(x\\) consists of:\n\n-\n**Rotate**\\(x\\) by \\(V^\\top\\)\n\nThis re-expresses \\(x\\) in the “principal” coordinate system of \\(W\\). -\n**Scale** each coordinate by the corresponding singular value \\(\\sigma_i\\)\n\nIf \\(x\\) happens to align with a singular vector with a large \\(\\sigma\\), the output norm becomes much larger than the input norm. If it aligns with a small \\(\\sigma\\) (or is orthogonal to large ones), the output becomes tiny. -\n**Rotate** the result by \\(U\\)\n\nThis gives the final output direction.\n\n### Satisfying the inequality[¶](#satisfying-the-inequality)\n\nWith both pieces in place, the dominance condition becomes a statement about directions:\n\n-\nLeft-hand side: For \\(\\|W_k x_j\\| \\cos \\theta_{ij}\\) to be large, two things must hold simultaneously. First, the sink token’s representation \\(x_j\\) must align strongly with the high-singular-value directions of \\(W_k\\), so that the key vector \\(k_j\\) = \\(W_k x_j\\) has large norm. Second, the query \\(q_i\\) must point in the same direction as \\(k_j\\) for \\(\\cos \\theta_{ij}\\) to be large. Since \\(k_j\\) is the output of \\(W_k\\), its direction is determined by \\(W_k\\)’s dominant singular structure. For \\(q_i\\) to reliably land in that same direction, \\(W_q\\) must strongly amplify it too. This is the core constraint: \\(W_q\\) and \\(W_k\\) must develop overlapping dominant singular subspaces, i.e they must agree on which directions to amplify.\n\n-\nRight-hand side: For \\(q_i⋅k_t\\) to remain small for all ordinary tokens \\(k_t\\), those tokens’ key vectors must not overlap with those directions. Therefore, the model achieves this by reserving a small set of specialized high-magnitude directions exclusively for the no-op mechanism.\n\nTaken together, both conditions force the model to concentrate representational energy into a small number of disproportionately amplified directions. In weight space, this shows up as a few rows or columns of \\(W_q\\) and \\(W_k\\) with anomalously large magnitude. In activation space, when inputs pass through these weight matrices, a small number of feature dimensions carry values far larger than the rest. These are the outlier channels. But how do they form?\n\n## Training Dynamics:[¶](#training-dynamics)\n\nOnce outlier channels exist, training actively reinforces them. Softmax with a near-one-hot distribution has gradients concentrated almost entirely on the dominant logit — the one produced by the outlier dimensions. The gradient signal flowing back into \\(W_q\\) and \\(W_k\\) is therefore concentrated in precisely those outlier directions, making them stronger with each update.\n\nLayerNorm compounds this. Because it suppresses global scale and equalizes norms across tokens, upstream MLPs must produce outputs large enough that the outlier signal survives normalization and generates sufficient dynamic range for softmax. This pushes earlier layers to amplify outlier dimensions further.\n\nThe result is a self-reinforcing loop: outlier channels produce sharp softmax distributions, sharp distributions concentrate gradients back into outlier channels, and the cycle repeats across training.\n\nTherefore, outliers are not isolated phenomena; they emerge systematically and are interconnected across multiple dimensions within the model. The ICLR paper, [Sytematic Outliers in Large Language Models](https://arxiv.org/pdf/2502.06415) studied how weight outliers give rise to activation outliers, which in turn propagate into attention outliers for Llama 2 7B.\n\n*Source: Systematic Outliers in Large Language Models*\n\n## Conclusion[¶](#conclusion)\n\nSome Summary\n\n## References & Further Reading[¶](#references-further-reading)\n\n-\n[Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing](https://arxiv.org/pdf/2306.12929): The first one to discover that outliers emerge because softmax output needs to sum to 1. The authors also propose a “Clipped Softmax” & a “Gated Attention” function which in their experiments prevents the formation of outliers. Larger adoption remains to be unseen though. -\n[Attention Is Off By One](https://www.evanmiller.org/attention-is-off-by-one.html): Evan Miller read the above paper & suggested a simple tweak in the softmax formula: adding a 1 in the denominator so that softmax doesn’t necessarily need to sum up to 1. This gained a lot traction when it was released but again it’s not used by anyone in practice. -\n[Systematic Outliers in Large Language Models](https://arxiv.org/pdf/2502.06415): The authors did a systematic analysis of outliers by defining weigtht outliers, activation outliers and attention outliers & studied how they give rise to each other. They also study various different attention variants to find out how to prevent the formation of outliers & discover that a learnable scaling factor Sc(x) that dynamically adjusts the attention weights (Explicit Context-Aware Scaling) doesn’t lead to the formation of outliers. Again, larger adoption remains to be unseen. -\n[Attention is not all you need](https://arxiv.org/pdf/2103.03404): Not directly related but since the post discusses a subtle limitation of the standard attention mechanism, you’ll find this one interesting. The authors essentially show that you just have attention layers & no mlps or residuals in our model, the output converges with a cubic rate to a rank one matrix with identical rows. The MLPs & Residual Connections fight against this. So, the residual connection doesn’t just help with gradient flow but also prevents a reprentational collapse.\n\n[Show Comments]", "url": "https://wpnews.pro/news/why-do-transformers-have-outliers", "canonical_source": "https://hello-fri-end.github.io/2026/07/why-do-transformers-have-outliers/", "published_at": "2026-06-30 18:30:00+00:00", "updated_at": "2026-06-30 22:20:38.435649+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "ai-research"], "entities": ["Transformer", "Vision Transformer", "softmax", "attention"], "alternates": {"html": "https://wpnews.pro/news/why-do-transformers-have-outliers", "markdown": "https://wpnews.pro/news/why-do-transformers-have-outliers.md", "text": "https://wpnews.pro/news/why-do-transformers-have-outliers.txt", "jsonld": "https://wpnews.pro/news/why-do-transformers-have-outliers.jsonld"}}