Analysis of Metastable States in the Transformer Activation Space A new analysis of metastable states in transformer activation spaces confirms that token representations cluster into metastable groups across layers in trained models, as predicted by a recent dynamical systems theory, but falsifies the theory's proposed mechanism. Researchers found that the interaction energy driving the process fails to increase monotonically in every model and prompt tested, while collapse speed is governed by the spectral radius of the value matrix rather than model dimension. The findings, based on experiments with real transformers, validate three core predictions—clustering, persistence, and separate two-timescale dynamics—while revealing that the theory's assumptions about energy monotonicity and depth-driven collapse do not hold in practice. This is the first entry in a sequence.Over about ten parts, this series will work through a few humble experiments that test a mathematical theory of attention against real trained transformers.A project summary:a recent paper by Geshkovski, Letrouit, Polyanskiy, and Rigollet models attention as a dynamical system on the sphere and proves that tokens cluster and drift toward consensus, with a metastable two-timescale structure along the way — under the assumption that Q, K, and V are all the identity. This sequence asks how much of that survives in trained models most of it does , finds the one prediction that failsuniversally the energy is not monotone , and then spends the rest of the series tracking that failure to its mechanism in the value matrix. Links: all the code for every phase. GitHub repository : Geshkovski et al., the theory this whole investigation is built on Original paper : a video going over the original paper for those who prefer video YouTube walkthrough : This is the first in what will be a sequence of ~10 posts. Each post will be aligned with a phase from the github. Metastability is real in trained transformers — but the mechanism the theory proposes for it is not. Tokens cluster into metastable groups the way the idealized model predicts; the energy that theory says drives the process is violated in every model tested; and how fast a model collapses is set by its value matrix, not by depth or width as the theory claims. Three predictions of the idealized theory survive the move to trained models, and survive cleanly. Token representations cluster over layers CONFIRMED, every model, every real prompt . The clusters persist across runs of layers — metastable plateaus — before reorganizing CONFIRMED . And the two timescales of metastability — fast formation, slow merging — are genuinely separate above a depth threshold CONFIRMED . Two predictions break. The interaction energy the theory requires to increase monotonically falls somewhere in every model, every prompt, every temperature tested — the clustering is real but it is not the pure-attraction gradient flow the theory describes FALSIFIED . And collapse speed is governed by the spectral radius of the value matrix , not by model dimension; the higher-dimensional model in our matched pair collapses slower , the reverse of what Theorem 6.1 predicts FALSIFIED . Two honest caveats carry through: the GPT-2 attention-routing result is suspended because the causal mask can manufacture the signal on its own, and the two-timescale ratio magnitudes are provisional pending a rerun under a corrected definition. The clustering also does not appear in an untrained model with the same architecture, which is what makes "the network learns to do this" a finding rather than a property of the wiring. Scope. Everything here measures what happens to token representations inside the stack — the geometry of the residual stream, layer by layer. It is not a claim about what the model outputs or how it behaves. When a model's tokens collapse to a near-single point in the residual stream, that is a statement about internal geometry; whether and how it affects the model's predictions is a separate question Phase 2 takes up. Some Terminology - Sphere projection — each token's vector is divided by its length, so all tokens live on the unit sphere. Similarity between two tokens is their inner product cosine of the angle : +1 same direction, 0 orthogonal, −1 opposite. - β inverse temperature — how sharply attention concentrates. β = 0 is uniform attention; large β is near-hard selection. The theory's metastability is a large-β phenomenon. - Mass-near-1 — fraction of token pairs with inner product above 0.9. A direct read on "how much of the cloud has clustered." Goes from 0 to 1. - Effective rank — how many dimensions the token cloud effectively spans. Falls toward 1 as the cloud collapses to a point. - CKA — a single 0-to-1 score for how similar one layer's whole geometry is to the next. Near 1 across several layers = a plateau. - HDBSCAN — a clustering algorithm that discovers the number of clusters from the data and can label a token "noise" if it belongs to no cluster yet. - Plateau — a run of layers over which a signal stays flat; the operational stand-in for a metastable window. - Fiedler value λ₂ — a graph measure of how close the attention graph is to falling into separate pieces. Near 0 = attention split into clusters; large = everything mixing. - E β interaction energy — a scalar the theory's dynamics must push monotonically up . A step where it falls is a "violation" — evidence of a repulsive force the idealized model cannot produce. - ρ V spectral radius — the largest eigenvalue magnitude of the value matrix; the per-layer amplification factor that sets collapse speed. - OV circuit — the composed output-times-value map W O W V; the residual-stream-to-residual-stream operator Phase 2 argues is the right object to eigendecompose Phase 1 uses V alone as a proxy . The paper this post examines proves that, in an idealized transformer, token representations sitting on the unit sphere are pulled together by attention, group into clusters, and in the long run collapse to a single point. The proof is clean, but it holds under an assumption no trained model satisfies: that the query, key, and value projections are all the identity matrix. This post answers one question. Does the clustering-and-metastability picture the paper establishes for the idealized model show up in real trained transformers — and where does it break once Q, K, and V are arbitrary learned matrices? This is Phase 1: what happens . We measure whether the predicted behavior appears, and we record exactly where the trained models diverge from the theory. Phase 2, which follows, takes the divergences Phase 1 finds and asks why — what mechanism in the learned weights produces them. The model: every token is a point on the unit sphere. At each layer, attention moves each token toward a similarity-weighted average of the others — tokens it already points toward pull on it hardest, with the sharpness of that weighting set by an inverse temperature β. Similar tokens attract. Iterate the map and the tokens drift together; the long-run limit is a single point, every token identical. Consensus. The part that makes the theory interesting is not the endpoint but the route to it. For large β the collapse does not happen all at once. It happens on two separate timescales. First, fast: the cloud snaps into a small number of distinct clusters. Then, slow: those clusters merge pairwise, one absorption at a time, until only one survives. Between the moment the clusters form and the moment the last two merge, the structure is effectively frozen — a long window in which the cluster count holds steady while the dynamics idle. That frozen window is metastability , and the paper conjectures it appears for large β, before the final collapse. A claim about metastability is therefore a claim about time : not just that clusters exist, but that they persist for a stretch that is long relative to how quickly they formed. The gap that makes this an empirical project is the Q = K = V = I assumption. Under it, attention is pure symmetric attraction with no learned reshaping. A trained model has none of that. The query and key matrices reshape which tokens count as similar in the first place; the value matrix reshapes how a token actually moves once it is pulled. The idealized dynamics are purely attractive — they can only ever bring tokens together — but a learned value matrix can carry negative or complex eigenvalues, which would let a layer push tokens apart . None of the paper's convergence guarantees transfer directly once these matrices are arbitrary. The question is how much of the qualitative behavior survives the move from the idealized map to a learned one. Concretely, the paper leaves several open problems; this post tests three of them, by content rather than by number. The first is whether the clustering and the two-timescale metastability survive in trained models at all Groups A and B . The second is whether the gradient-flow structure survives — the theory's dynamics ascend a specific energy, and that energy must increase monotonically; we test whether it does Group D . The third concerns what governs convergence speed: the paper's Theorem 6.1 ties faster collapse to higher dimension, and we test that prediction directly Group E . One instrument group sits alongside these to check the mechanism — whether attention itself, not just the resulting geometry, shows the structure the theory requires Group C . Seven models, chosen so that three independent axes can be read off the same measurements, plus one untrained control. The GPT-2 family — small, medium, large, and xl — is the depth sweep: 12, 24, 36, and 48 transformer blocks at model dimensions of 768, 1024, 1280, and 1600, all sharing the same decoder architecture and causal attention mask. Holding the architecture fixed and varying depth is what lets any depth-threshold claim be tested within a single family. BERT-base is the encoder counterpart: 12 layers, dimension 768, bidirectional attention. Comparing it against GPT-2 isolates the encoder-versus-decoder axis, and — as it turns out — exposes a confound, since GPT-2's causal mask can manufacture cluster-like attention structure on its own Group C . ALBERT-base and ALBERT-xlarge supply the third axis. ALBERT shares a single transformer layer's weights across all of its depth, applying the same map repeatedly. That makes the layer index a literal iteration count, so for ALBERT "depth" is a clean time knob — the closest thing among real models to the paper's iterated map, and the natural place to look for two-timescale behavior. We run ALBERT to extended iteration counts a dense sweep up to ~48–60 iterations precisely to give the slow timescale room to play out. The two ALBERT sizes also give a dimension contrast at otherwise matched architecture: base at dimension 768, xlarge at 2048. That single pair is the decisive test of the Theorem 6.1 dimension prediction Group E , because everything except dimension is held roughly fixed. A randomly-initialized albert-base-v2-random — the ALBERT-base architecture with untrained weights — is run as a control. It answers the question that makes the whole project non-trivial: is the clustering learned , or does the architecture produce it with any weights at all? If a random-weight model clustered the same way, "trained transformers cluster their tokens" would be a statement about the wiring, not about what training discovers. This is the null the universality claims are measured against, and its result belongs in the verdict table. Prompts. The prompt set is deliberately diverse, so that "universal" means across language, modality, and length , not across four similar English paragraphs. The core English prose prompts span a wide length range — short heterogeneous ~23 tokens, semantically mixed , paper excerpt ~306 tokens , wiki paragraph ~450 tokens , and sullivan ballou ~489 tokens — and the set has since been extended to roughly nine real prompts adding at least one non-English prompt, source code , and LaTeX / mathematical notation , plus a couple of others. The diversity is itself a result: if clustering, plateaus, and energy violations appear in code and in equations as readily as in English prose, the phenomenon is a property of the dynamics, not of natural-language statistics. It also opens a sub-question worth answering explicitly in the synthesis — does any prompt class behave differently? If code or LaTeX clusters on a visibly different timescale, that is a new sub-finding; if it does not, "even code and equations cluster the same way" is a strong line. A length-sweep facility can truncate wiki paragraph to a series of token targets wiki 64 , wiki 128 , … to separate prompt length from prompt content as a variable. The degenerate control, repeated tokens , is a string of one token repeated. It has no real structure to cluster around, so it serves one purpose only — a collapse-speed diagnostic — and is excluded from every metastability aggregation , reported separately under a collapse-controls heading. An earlier version of the two-timescale ratio misused this control's collapse layer as a denominator and produced a self-contradictory table; that error is corrected in Group E, and the control is now confined to its diagnostic role. What we extract. For every model, prompt, layer triple, two raw objects: the residual-stream activations and the attention weights. The activations are L2-normalized onto the unit sphere — one line of code — and that projection is the entire bridge between the paper's geometry and the model's internals. From the sphere-projected activations we form the per-layer Gram matrix of pairwise inner products. Every quantity in every group is computed from these objects: the Gram matrix, the normalized activations, and the attention weights. Nothing else is stored; everything else is derived. The analysis runs in five groups, each answering one question, in an order that builds from existence to mechanism to the theory's sharpest prediction. Group A asks whether clusters form at all. Group B asks whether they persist — the operational test of metastability. Group C turns to the attention matrix and asks whether it carries the same story the geometry does. Group D tests the theory's most falsifiable claim, that a specific interaction energy increases monotonically. Group E asks whether the two timescales are genuinely separate, and what governs the gap. Each group below opens with its answer, then the question, then the finding. The method and concept boxes are in dropdowns; the result does not depend on opening them. Question: does the token cloud actually reorganize into distinct groups as depth increases, and does the reorganization look like what the theory predicts? The three instruments in this group each answer a slightly different version of the same question. The inner-product histogram and mass-near-1 ask whether the raw pairwise geometry is changing. Effective rank asks whether the token cloud is collapsing onto fewer dimensions. HDBSCAN asks whether that collapse is structured — i.e., whether distinct groups are forming rather than a smooth smear. Together they triangulate on "yes, clusters form" without relying on any single algorithm or threshold. Every instrument in this group reads from two objects computed once per layer: the sphere-projected activations and their Gram matrix. Everything downstream is a function of those two. The projection is the entire bridge between the paper's geometry and the model's residual stream — one line: python import torch.nn.functional as Fdef layernorm to sphere activation: torch.Tensor - torch.Tensor: """L2-normalize each token vector onto the unit sphere.""" return F.normalize activation, p=2, dim=-1 The per-layer loop then computes the normed activations and the Gram matrix once, and hands both to every metric below: normed = layernorm to sphere activations .numpy n tokens, d — on S^{d-1}G = normed @ normed.T n tokens, n tokens — ⟨x i, x j⟩ G i, j is exactly $\langle x i, x j \rangle$. Every quantity in Group A is read off G or off normed . At each layer, we compute the pairwise inner products between every pair of token representations and histogram them. Mass-near-1 is the fraction of pairs above 0.9. Why this instrument: The paper's entire theoretical machinery is built around what happens to pairwise inner products over time. Its own Figure 1 shows this exact histogram for ALBERT XLarge. Running the same diagnostic is the most direct possible replication: if the paper's qualitative story holds in trained models, this histogram should show a spike migrating toward 1 as depth increases. If the spike never forms, the clustering story is over before it starts.