# Transformer Attention Is Hopfield's 1982 Update Rule (And What That Tells Us About LLM Memory)

> Source: <https://dev.to/ki-mathias/transformer-attention-is-hopfields-1982-update-rule-and-what-that-tells-us-about-llm-memory-4i7f>
> Published: 2026-06-04 15:40:53+00:00

Hopfield's associative-memory equation from 1982 and the scaled dot-product attention from Vaswani 2017 are the same operation. One substitution turns one into the other. The 2024 Nobel Prize in Physics — to Hopfield and Hinton — is the academic acknowledgement that the mathematics behind today's LLMs was already written four decades ago, in a different vocabulary.

This is a condensed write-up of the longer, interactive piece at [ki-mathias.de/en/hopfield.html](https://ki-mathias.de/en/hopfield.html). Seven chapters there, five live MNIST demos. Here I focus on the four steps where the story has interesting empirical edges.

Modern Hopfield (Ramsauer et al., 2020) writes the update rule as

```
v ← X · softmax(β · Xᵀv)
```

where `X ∈ ℝ^(N×p)`

is the matrix of stored patterns and `β > 0`

is an inverse-temperature parameter.

Scaled dot-product attention (Vaswani et al., 2017) writes

```
Attention(Q, K, V) = V · softmax(Kᵀ Q / √d_k)
```

Set `Q = v`

, `K = X`

, `V = X`

, and `β = 1/√d_k`

. The two equations become identical. Not analogous. **Identical.** Same operation, written in two different notations.

In a Transformer, K and V are independent learned projections of the same input rather than the same matrix, and Q is yet another projection. Those are extra learnable transformations *around* the Hopfield core; the softmax-weighted lookup in the middle is unchanged.

Krotov & Hopfield (2016) had already worked out the *dense associative memory* generalisation that gives this form its exponential storage capacity. Vaswani 2017 reached the same equation by iterating on machine-translation benchmarks. Ramsauer 2020 noticed they were the same. The independent rediscovery is itself diagnostic: the structure isn't a design choice, it's a forced consequence of the requirements.

The original 1982 recall rule is

```
v_i ← sign(Σ_j W_ij · v_j)        # W = (1/N) Σ_μ ξ_μ ξ_μᵀ,  W_ii = 0
```

This is the Hebb construction. Store ten MNIST digits, query each with 15 % pixel noise, observe what comes back.

Result: **all ten queries collapse into the same end-state** — an image that isn't visually any of the stored digits. Mean pairwise similarity between the ten "recalls": 0.99.

This is fully explained by the spectrum of `W_Hebb`

. The eigenvalues are roughly

```
λ₁ ≈ 6.65,   λ₂ ≈ 0.65,   λ₃ ≈ 0.48,   ...
```

A factor-of-ten gap between `λ₁`

and the rest. The top eigenvector is essentially `ξ̄ = (1/p) Σ_μ ξ_μ`

, the per-pixel mean — cosine 0.9999.

The Hebb rule is provably correct *only* under two conditions:

MNIST digits violate both: pairwise inner products are 400–600 out of 784 (≈ two thirds of the pixels shared), and mean pixel values are −0.63 to −0.90 (much more "background" than "ink"). The failure is therefore not an implementation bug; it's the construction operating outside its range of validity. Centring the patterns kills the bias sink but reveals the next defect — the `v → −v`

symmetry of `E(v) = -½vᵀWv`

causes recalls to land on *negations* of stored patterns.

The didactic point: **a learning rule is correct or incorrect relative to a data geometry.** "Hebb is broken" is not a sentence. "Hebb is broken *on MNIST*" is.

The Personnaz–Guyon–Dreyfus construction (1985) keeps the same recall machinery but builds W differently:

```
W_PI = X (XᵀX)⁻¹ Xᵀ
```

The factor `(XᵀX)⁻¹`

is exactly what's missing in Hebb — the inverse of the pattern-pattern Gram matrix. It removes correlations between stored patterns before the matrix becomes the energy landscape. For orthogonal patterns the two rules coincide; for correlated ones, only `W_PI`

carries the algebraic guarantee

```
W_PI · ξ_p = ξ_p              # every stored pattern is a fixed point with eigenvalue 1
```

Empirical capacity on MNIST, p stored patterns, 10 % pixel noise, fraction of queries that recover the original:

| p | Hebb | Pseudoinverse |
|---|---|---|
| 10 | 0 % | 100 % |
| 100 | 0 % | 100 % |
| 150 | 0 % | 97 % |
| 200 | 0 % | 32 % |
| 250 | 0 % | 1 % |
| 300 | 0 % | 0 % |

A sharp phase transition between p ≈ 150 and p ≈ 250. Far below the algebraic ceiling p = N = 784, where the Gram matrix becomes singular. The identity `W_PI ξ_p = ξ_p`

holds throughout — but the **basin of attraction** around each fixed point shrinks as the patterns crowd one another, and 10 % noise overshoots the basin once p exceeds ~150.

Side note for readers who came in via the [Eigenvalues post](https://ki-mathias.de/en/eigenvalues.html): the operator `X(XᵀX)⁻¹Xᵀ`

is *exactly* ridge regression with `λ = 0`

— the pseudoinverse hat matrix. The Hopfield update with this W is therefore a non-linear filter built on top of an ordinary projection onto the span of stored patterns. The capacity cliff is the cliff of unregularised projection at near-singular Gram.

Stop iterating `sign(Wv)`

. Replace it with the soft, input-dependent

```
v ← X · softmax(β · Xᵀv)
```

Three structural changes happen at once:

| Component | Classical (1982/1985) | Modern (Ramsauer 2020) |
|---|---|---|
| Operator | fixed `W ∈ ℝ^(N×N)`
|
none — direct softmax-lookup on X |
| Update | linear in v + sign | non-linear (softmax in v) |
| Energy | quadratic `-½ vᵀWv`
|
log-sum-exp + `½‖v‖²` (Lyapunov) |
| Convergence | iterative, many sweeps | one step (for sufficiently large β) |
| Capacity | dynamically ≪ N |
`Ω(exp(N))` — exponential in N |

The exponential capacity is the practical reason this works for LLMs at all: with `N = 768`

(a typical embedding dim), you can store effectively-unbounded context. With `N = 784`

(MNIST), the classical pseudoinverse rule plateaus near p ≈ 150 on real data.

And the parameter `β`

is interpretable. At small `β`

, the softmax is near-uniform and the recall is a soft average of all stored patterns. At large `β`

, it concentrates on the single best match — Modern Hopfield converges to 1-nearest-neighbour. Ramsauer's analysis of Transformer heads shows early layers running at low β (global averaging) and deeper layers running at high β (sharp lookup on a single token). The classical "attention is mysterious" complaint dissolves into a continuous interpolation between two known operations.

The interesting finding from [Negri, Tudisco, Lucibello et al. 2024](https://arxiv.org/abs/2407.05658) — *Random Features Hopfield Networks generalize retrieval to previously unseen examples* — is **not** "we made Hopfield better." It's the opposite:

The exact same learning rule that scores 65 % accuracy on MNIST (i.e., barely matches 1-NN, no real generalisation) achieves

perfect generalisation— magnetisation 1.0 on unseen test patterns — when the data is built as a sparse mixture of a small set of random features.

Setup: let `F ∈ {-1,+1}^(N×D)`

be a random feature matrix. Each pattern is `ξ = sign(F · c)`

with `c`

an L-sparse binary coefficient vector. Three sets share the same F:

Sweep `α = p/N`

and measure the magnetisation of each set. Three phases appear in order:

With the **pseudoinverse rule** this last transition is a hard jump to magnetisation 1.0, and the math explains why: once the trained patterns span enough of the feature mixtures, every feature mixture becomes an eigenvector of `W_PI`

with eigenvalue 1 — by the same identity that made stored patterns fixed points.

The takeaway is not subtle: **generalisation is a property of the data geometry, not of the learning rule.** A textbook claim that "this learning rule generalises better" is well-typed only relative to a class of data. The reason language models generalise so well isn't that the attention mechanism has a special "ability" — it's that natural language already has the sparse compositional structure that makes Hopfield-style retrieval transfer beyond the training set. Words and constructions are a finite set of components; sentences are sparse mixtures. Hopfield-friendly by accident of biology.

A non-exhaustive list, with the empirical claim each item is making:

`Hopfield`

, `HopfieldPooling`

, `HopfieldLayer`

modules, swap-in replacements for LSTM / pooling / attention.If you spot a mistake or a sharper statement of any of the above, the source repo is open — corrections welcome.