Hopfield's associative-memory equation from 1982 and the scaled dot-product attention from Vaswani 2017 are the same operation. One substitution turns one into the other. The 2024 Nobel Prize in Physics — to Hopfield and Hinton — is the academic acknowledgement that the mathematics behind today's LLMs was already written four decades ago, in a different vocabulary.
This is a condensed write-up of the longer, interactive piece at ki-mathias.de/en/hopfield.html. Seven chapters there, five live MNIST demos. Here I focus on the four steps where the story has interesting empirical edges.
Modern Hopfield (Ramsauer et al., 2020) writes the update rule as
v ← X · softmax(β · Xᵀv)
where X ∈ ℝ^(N×p)
is the matrix of stored patterns and β > 0
is an inverse-temperature parameter.
Scaled dot-product attention (Vaswani et al., 2017) writes
Attention(Q, K, V) = V · softmax(Kᵀ Q / √d_k)
Set Q = v
, K = X
, V = X
, and β = 1/√d_k
. The two equations become identical. Not analogous. Identical. Same operation, written in two different notations.
In a Transformer, K and V are independent learned projections of the same input rather than the same matrix, and Q is yet another projection. Those are extra learnable transformations around the Hopfield core; the softmax-weighted lookup in the middle is unchanged.
Krotov & Hopfield (2016) had already worked out the dense associative memory generalisation that gives this form its exponential storage capacity. Vaswani 2017 reached the same equation by iterating on machine-translation benchmarks. Ramsauer 2020 noticed they were the same. The independent rediscovery is itself diagnostic: the structure isn't a design choice, it's a forced consequence of the requirements.
The original 1982 recall rule is
v_i ← sign(Σ_j W_ij · v_j) # W = (1/N) Σ_μ ξ_μ ξ_μᵀ, W_ii = 0
This is the Hebb construction. Store ten MNIST digits, query each with 15 % pixel noise, observe what comes back.
Result: all ten queries collapse into the same end-state — an image that isn't visually any of the stored digits. Mean pairwise similarity between the ten "recalls": 0.99.
This is fully explained by the spectrum of W_Hebb
. The eigenvalues are roughly
λ₁ ≈ 6.65, λ₂ ≈ 0.65, λ₃ ≈ 0.48, ...
A factor-of-ten gap between λ₁
and the rest. The top eigenvector is essentially ξ̄ = (1/p) Σ_μ ξ_μ
, the per-pixel mean — cosine 0.9999.
The Hebb rule is provably correct only under two conditions:
MNIST digits violate both: pairwise inner products are 400–600 out of 784 (≈ two thirds of the pixels shared), and mean pixel values are −0.63 to −0.90 (much more "background" than "ink"). The failure is therefore not an implementation bug; it's the construction operating outside its range of validity. Centring the patterns kills the bias sink but reveals the next defect — the v → −v
symmetry of E(v) = -½vᵀWv
causes recalls to land on negations of stored patterns.
The didactic point: a learning rule is correct or incorrect relative to a data geometry. "Hebb is broken" is not a sentence. "Hebb is broken on MNIST" is.
The Personnaz–Guyon–Dreyfus construction (1985) keeps the same recall machinery but builds W differently:
W_PI = X (XᵀX)⁻¹ Xᵀ
The factor (XᵀX)⁻¹
is exactly what's missing in Hebb — the inverse of the pattern-pattern Gram matrix. It removes correlations between stored patterns before the matrix becomes the energy landscape. For orthogonal patterns the two rules coincide; for correlated ones, only W_PI
carries the algebraic guarantee
W_PI · ξ_p = ξ_p # every stored pattern is a fixed point with eigenvalue 1
Empirical capacity on MNIST, p stored patterns, 10 % pixel noise, fraction of queries that recover the original:
| p | Hebb | Pseudoinverse |
|---|---|---|
| 10 | 0 % | 100 % |
| 100 | 0 % | 100 % |
| 150 | 0 % | 97 % |
| 200 | 0 % | 32 % |
| 250 | 0 % | 1 % |
| 300 | 0 % | 0 % |
A sharp phase transition between p ≈ 150 and p ≈ 250. Far below the algebraic ceiling p = N = 784, where the Gram matrix becomes singular. The identity W_PI ξ_p = ξ_p
holds throughout — but the basin of attraction around each fixed point shrinks as the patterns crowd one another, and 10 % noise overshoots the basin once p exceeds ~150.
Side note for readers who came in via the Eigenvalues post: the operator X(XᵀX)⁻¹Xᵀ
is exactly ridge regression with λ = 0
— the pseudoinverse hat matrix. The Hopfield update with this W is therefore a non-linear filter built on top of an ordinary projection onto the span of stored patterns. The capacity cliff is the cliff of unregularised projection at near-singular Gram.
Stop iterating sign(Wv)
. Replace it with the soft, input-dependent
v ← X · softmax(β · Xᵀv)
Three structural changes happen at once:
| Component | Classical (1982/1985) | Modern (Ramsauer 2020) |
|---|---|---|
| Operator | fixed W ∈ ℝ^(N×N) |
|
| none — direct softmax-lookup on X | ||
| Update | linear in v + sign | non-linear (softmax in v) |
| Energy | quadratic -½ vᵀWv |
|
log-sum-exp + ½‖v‖² (Lyapunov) |
||
| Convergence | iterative, many sweeps | one step (for sufficiently large β) |
| Capacity | dynamically ≪ N | |
Ω(exp(N)) — exponential in N |
The exponential capacity is the practical reason this works for LLMs at all: with N = 768
(a typical embedding dim), you can store effectively-unbounded context. With N = 784
(MNIST), the classical pseudoinverse rule plateaus near p ≈ 150 on real data.
And the parameter β
is interpretable. At small β
, the softmax is near-uniform and the recall is a soft average of all stored patterns. At large β
, it concentrates on the single best match — Modern Hopfield converges to 1-nearest-neighbour. Ramsauer's analysis of Transformer heads shows early layers running at low β (global averaging) and deeper layers running at high β (sharp lookup on a single token). The classical "attention is mysterious" complaint dissolves into a continuous interpolation between two known operations.
The interesting finding from Negri, Tudisco, Lucibello et al. 2024 — Random Features Hopfield Networks generalize retrieval to previously unseen examples — is not "we made Hopfield better." It's the opposite:
The exact same learning rule that scores 65 % accuracy on MNIST (i.e., barely matches 1-NN, no real generalisation) achieves
perfect generalisation— magnetisation 1.0 on unseen test patterns — when the data is built as a sparse mixture of a small set of random features.
Setup: let F ∈ {-1,+1}^(N×D)
be a random feature matrix. Each pattern is ξ = sign(F · c)
with c
an L-sparse binary coefficient vector. Three sets share the same F:
Sweep α = p/N
and measure the magnetisation of each set. Three phases appear in order:
With the pseudoinverse rule this last transition is a hard jump to magnetisation 1.0, and the math explains why: once the trained patterns span enough of the feature mixtures, every feature mixture becomes an eigenvector of W_PI
with eigenvalue 1 — by the same identity that made stored patterns fixed points.
The takeaway is not subtle: generalisation is a property of the data geometry, not of the learning rule. A textbook claim that "this learning rule generalises better" is well-typed only relative to a class of data. The reason language models generalise so well isn't that the attention mechanism has a special "ability" — it's that natural language already has the sparse compositional structure that makes Hopfield-style retrieval transfer beyond the training set. Words and constructions are a finite set of components; sentences are sparse mixtures. Hopfield-friendly by accident of biology.
A non-exhaustive list, with the empirical claim each item is making:
Hopfield
, HopfieldPooling
, HopfieldLayer
modules, swap-in replacements for LSTM / pooling / attention.If you spot a mistake or a sharper statement of any of the above, the source repo is open — corrections welcome.