This post was produced as part of the Iliad Fellowship under the mentorship of Dmitry Vaintrob.
Tl;dr: Power-law ("heavy-tailed") distributions have universality theorems similar to those which make Gaussians common. We observe many things in ML are power-law distributed, most robustly and interestingly, the spectra of weight matrices. I explain how we can think of power-laws as being a natural generalization of the idea of 'sparsity', interpolating between true sparsity and Gaussianity according to the 'tail-index' of the distribution. I share some hypotheses about how this might relate to the 'sparse'/'discrete'/'factored' representations that neural networks seem to learn.
I promise this is not a Santa-Fe-Institute encomium for power laws or "black swans"; different genre.
Gaussian distributions are important because of the 'central limit theorem': morally, any phenomenon which is the sum of a bunch of tiny additive effects will give you a bell-curve. Lots of genes contribute a bit to height, so height is roughly normally distributed in the population, as each person is an approximately independent random draw of each of those height alleles.
Except the central limit theorem is a psyop.
The 'central limit theorem' which makes everything converge to Gaussian only holds, in its usual form, for random variables with finite variance. If you don't have finite variance, the 'generalized' central limit theorem applies, and you get what are called Lévy -stable distributions. If some underlying phenomenon is -stable, adding together many of them will also be an -stable distribution after the right rescaling. The Gaussian is the special case , but otherwise can be between 0 and 2.
The CLT needs finite variance: for i.i.d. with mean and variance ,
But there's a more general version of this for random variables which don't have finite variance (or, optionally, even no finite mean). These more general distributions are called stable distributions, and the non-Gaussian cases have power-law tails with exponent : roughly, for the density tail.
Closure + GCLT, in brief. Stable means sums of independent copies stay in the family, , with in the symmetric unit-scale case. Gaussian is , so to sum 0-mean Gaussians, sum their variances.
The Generalized CLT: sums of power-law-tailed variables (, ), centered when the mean exists and normalized by rather than , converge to the stable law . The tail exponent becomes the stability index. Familiar cases: Gaussian, Cauchy, "the" Lévy (since -stable distributions as a whole are often, incl. by me, called Lévy). NB that the actual law of usually isn't easy to write down, so everything in power-law world is done in terms of characteristic functions. In the symmetric, centered case, .
Pareto distributions aren't quite the same thing as alpha-stable distributions, but they are the cleanest toy model for the same tail exponent. The Pareto principle, "80% of the outcomes can be explained by 20% of the causes", is actually a precise statement about for a Pareto tail: the 80:20 split appears near . The exponent controls how much of the mean is carried by rare events, and in the limit almost all mass is carried by a finite number of extreme draws. (See stable distributions.)
[1] Figure: A simulated Brownian walk and a simulated heavy-tailed Lévy walk, shown on the same spatial scale. The Brownian path is made from many comparable steps; the Lévy path is dominated by rare jumps. Image credit to Wikipedia
Figure: Toy matrices with Gaussian versus heavy-tailed entries. The spectrum of is the diagnostic object: it records the strength and spread of directions in weight space.
The main ML motivation here is Heavy-Tailed Self-Regularization and the associated WeightWatcher analysis. Mahoney has a cool-looking presentation here that may be a bit more pedagogical. Their thesis is that weight matrices in neural networks often pass through spectrum-level training phases:
Figure: A simplified version of the HTSR story: an initially random-looking spectrum develops separated signal directions, then a broad heavy-tailed spectrum. Image credit ibid.
Martin, Peng & Mahoney show that spectrum-only tail exponents can predict trends in generalization without access to training or test data. [2] This is the observation that got me here. It's worth emphasising that this is more 'perihelion of mercury'-style theorizing - their measurements do not definitively show this transition occurring cleanly and consistently everywhere in neural networks. I think their measurements are suggestive, and I'm guided here more by theoretical intuition.
image credit ibid.
To make the mechanism more legible, I'll quickly summarize the idea of a BBP transition, as I think this is a useful thing to know anyways, and their theory builds on it.
The BBP transition is a phase transition that occurs when you're looking for a low-rank signal inside an otherwise random covariance matrix. You can do this with different models, but for here we'll be thinking about weight matrices of neural networks, and looking at their Gram matrix, . Above a threshold, the signal pops out as a separated spike eigenvalue. Schematically,
where is a matrix with iid random Gaussian entries, and is a rank-one signal with strength . Morally, you can "see" the signal iff its strength is above a threshold value .
I think this is worth knowing about on its own as a diagnostic of learning. [3] BBP is something like the "classical" or "linear" regime of learning: it is easy to say whether the model learned thing X, and where the knowledge of thing X "is" (in the corresponding eigenvector). You can even think of a linear-regression regularizer as the dividing line between spikes/signal and noise. This is great if there is a well-defined gap.
But if you try to learn too many things, and you get too many "spikes" ( is the comfortable BBP regime), this stops being a well-defined phase transition.
HTSR talks about what happens when they're all spikes. Empirically, they fit a power law with exponent to the tail of each layer's spectrum and show that an aggregated predicts generalization without access to held-out data.
[2:1] Why might this work? What does it mean that it does?
There are two useful ways of thinking about why this might work. One view is that it is just some metric of signal propagation: the spectral norm is already a decent generalization predictor, so the power-law fit may just give a better fit because it has a fittable parameter.
[4] The second view though is theoretical: we know BBP spikes are ontically robust and "have meaning", and it sure looks like power-law spectra are like "the limit where everything is BBP spikes". We are in a regime where BBP no longer cleanly applies, but we are also definitely not in Kansas.
Further, information-theoretically, power-law distributions are a natural generalization of sparsity to noisier settings: lighter tails correspond to gentler sparsity, and in the infinitely heavy-tail limit we recover the traditional sparsity notion of counting nonzero entries. If you buy this, then things start to make a bit more sense - we love sparsity! This is something like 'simplicity'. Thus it may be the case that power-law spectra are mechanistic signatures of neural networks implementing Occam's razor. This may give a way to measure how "memorize-y" vs generalize-y a neural network is, and might point toward a new ontology for "knowledge" in neural networks.
Mechinterp researchers may also enjoy: this anti-grokking paper also suggests that fitted can act as a diagnostic: anti-grokking corresponds to the average HTSR layer-quality metric deviating from , with treated as a warning sign in their experiments.
Image credit ibid.
Many heavy-tail claims about activations, gradient magnitudes, or batch noise appear secondary. The Outlier Features paper shows some of those phenomena can be mitigated with architectural tweaks, and Attention Sinks suggests some apparent heavy-tail behavior is a relatively simple signal artifact.
I remain confident that the weight SVD spectrum is structurally significant. The data-Gram spectrum story has support from Ruderman's natural-image statistics and from the solvable neural-scaling model of Maloney, Roberts & Sully, but these are just 'power-law in power-law out' - true but don't have the structural connection I want. Different origins of the power-law spectrum could lead to different mental models of NNs.
The LessWrong post Basic facts about language models during training is a useful empirical companion. An example plot of theirs for flavour:
The Wesley Erickson talk 'Heavy-tailed Noise & Stochastic Gradient Descent' is also quite good and is part of what turned me on to this line of thinking. Also has a good discussion of the debate over heavy-tailed noise. Two screenshots for flavour (oddly cropped to not include the speaker's face).
(Sections 3.A and 3.B, but not 3.C, are primarily LLM-generated.)
Nature doesn't believe in zeroes (or more precisely, in floating point equality) - all our metrics must be smooth. What is the "smooth" version of "sparsity" which doesn't rely on something being exactly zero? Obviously, small deviations should matter less than big things. Probably this should be relative: a '2.35' in a matrix whose entries are is different than the same '2.35' in a matrix whose entries are .
Let's imagine you have a list of magnitudes of length . For now, no tricky encoding: the operative definition of sparsity is "how much error reduction can I buy by representing one more mode?" If the magnitudes sorted from largest to smallest obey a power law,
where is the -error left after keeping the top modes. That is the headline result: below the threshold , best- error decays as a power law in .
For squared error, , so is the Gaussian/compressibility threshold. This is the same threshold as the finite-variance boundary from the GCLT section. If , a small fixed number of top modes captures a non-vanishing share of the squared mass, and any fixed fraction of top modes captures an increasing share as grows. If , the distribution is dense or marginal in the sense.
[[5]](https://www.lesswrong.com/feed.xml#fn-gEjj7JpwcHNXzmjvy-5)
As you get heavier tails, , the error after keeping only a few modes becomes essentially nil. In that limit, you recover the traditional definition of sparsity: only entries matter. So data distributed according to between 0 and 2 interpolates between truly sparse and minimally compressible.
[6] The mechanism: the Gaussian channel (, power constraint ) has a continuous Gaussian-optimal input (). Under -stable noise () the variance is infinite, so ordinary power is no longer the natural signal-strength measure. Fahs & Abou-Faycal study alpha-stable additive-noise channels under fractional-moment input constraints such as , and their broader support/discreteness results say that suitable super-logarithmic cost constraints can force bounded, discrete capacity-achieving inputs. Chain: heavy-tailed channel noise changed input-cost geometry discrete alphabets can become optimal.
(Note: I think this is still an interesting and plausible idea, but I would only bet at 30% that it's usefully true. I'm actively thinking about this.)
To send information in a channel which has Gaussian noise under a total-power constraint, one should send analog data. Under heavy-tailed noise with the right input-cost constraint, however, the optimal solution can become a discrete codebook. Consider that, if Lévy-like noise is fairly common in ML, then... well... your residual stream is a noisy channel with Lévy-like noise, and communication along it may prefer a discrete codebook. This is a mechanism by which we predict SAE features - and moreover, we might be able to use this model to predict when it fails.
Figure: In the Gaussian channel, the capacity-achieving input is continuous. In the alpha-stable noise channel under an appropriate input-cost constraint, the capacity-achieving input can become a finite discrete codebook.
Unironically big if true: a generic mechanism for an inductive bias towards discrete representations. General enough that a similar idea could possibly apply to the brain.
Why? Two perspectives. First, mechanically: on this account, Lévy-like noise is the mechanism which makes discrete codebooks optimal in some regimes. It is part of the mechanism by which we can expect discrete representations to occur. What's more, if we can work out this mechanism in more detail, this may give us the ability to directly extract those representations based on their noise signatures (c.f. how one could try to learn what parity-check code someone is using by flipping the bits of the encoding and seeing what happens to the decoding).
Second, universality: people differ in priors on how similarly they expect human brains and LLMs to learn, etc. Lévy-like noise provides the basis for an account which starts with a common mechanism: a prior on discrete representations can be encoded by learning algorithms which internally optimize communication under heavy-tailed noise. To zeroth order, a learning algorithm can promote discrete codebooks just by being in a universality class which prominently features heavy-tailed noise. That's not hard! There is at least suggestive neuroscience-adjacent work on heavy-tailed synaptic weight distributions and edge-of-chaos dynamics, but every time I ask about this Claude lists like 10 back-and-forth volleys in the intellectual flame-war over how to interpret these data. Caveas.
Anyways, this would be pleasing from a Kantian perspective: the apparent immanent "sparse" structure of reality revealed to be an artifact of the nature of intelligence. We need posit neither discreteness nor sparsity as a miraculous property of the ding-an-sich universe to which we have no access save by our perceptions. Instead, we can attribute this structure to the fact that 'discrete' learners are more efficient in a sufficiently wide regime than Gaussian learners. No unanswerable appeal to Solomonoff priors, no 'simplicity bias' (Why should the universe be simple? What does 'simple' mean, really? isn't it easier to believe that any fundamental simplicity is to be attributed to the perceivers?)
Two related ideas do not quite fit the main argument, but seem worth keeping on the table:
Power-law ("heavy-tailed") distributions have universality theorems similar to those which make Gaussians common. We observe many things in ML are power-law distributed, most robustly and interestingly, the spectra of weight matrices. I have argued that power laws are a natural generalization of sparsity: by varying the tail-index , they interpolate between true sparsity and Gaussian-like density. This gives one possible bridge between spectral structure, compressibility, and the factored or discrete representations neural networks seem to learn.
In writing this, I thought a lot about BBP transitions / learning quanta and random matrices with iid Lévy entries. Heavy-tailed random matrices have a lot of properties that could be the jumping-off point for mechanically principled definitions of things like "features" (localized eigenvectors, mobility edges, connections to sparse graph codes). But heavy-tailed random matrices are critically inconsistent with the light-tailed weight matrix entries found by Beren and others. I tried to rescue the Lévy-entry story: since Lévy noise is not isotropic and the eigenbasis is trivially sparse, one has to propose a new privileged basis other than the neuron basis. I poked at ICA, but at the end of the day the neuron basis is definitely the privileged basis, so trying to do anything else feels like cheating.
Light-tailed entries plus a power-law spectrum, morally, implies loads of small correlations. Some experiments on retinal ganglion cells show that maxent / Ising models based on one- and two-body statistics can recover much of the collective structure, and related work argues the fitted model can sit near criticality. This suggests that we might do better to think in terms of many low-order interactions rather than organized higher-order correlations. More on this in a later piece: the spirit of the thing is to move from the "sparse sensing" regime that inspired SAEs to a "weak sensing" regime characterized by many noisy overlapping measurements. I'm very excited to think more about this, as it feels much more brain-like.
LLM Usage Statement: I had Claude write out some of the standard math exposition in sections 3.A. and 3.B. based on a bullet list. It also handled MD formatting and other emendatory work. Plots and images not taken from papers were generated by GPT (by code and gpt-image, resp.).
Technically, expected values are often undefined for -stable distributions, but y'know what I mean. ↩︎
Martin, Peng & Mahoney report these spectrum-only generalization predictions in Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. ↩︎ ↩︎
Though please note, BBP is about statistical signal-detection more than "learning" per se. ↩︎ This surprise is somewhat mitigated if you look at the predictors to which they compare, especially the weight-matrix spectral norm; that such a simple quantity does nearly as well surprises me and is moderate counterevidence against the claim that power-law scaling is the mechanical bridge between sparsity and generalization. On the other hand they have a 140pg theory PDF justifying it - jury will remain out until that rainiest of days. ↩︎
This is extremely interesting if you don't believe in float equality - we shouldn't be allowed to have qualitatively different behaviour at ! This is a 'singular limit', and addressing it means using intermediate asymptotics - one of my favourite things in math, but paraphrasing the pokemon professor, this is neither the time nor the place. ↩︎
Note here I'm using a metric/error-based notion of compressibility - information-theoretic measures are only sensible over real-variables if you have noise - see the section below on channel coding! ↩︎