Analysis of Metastable States in the Transformer Activation Space

A new analysis of metastable states in transformer activation spaces confirms that token representations cluster into metastable groups across layers in trained models, as predicted by a recent dynamical systems theory, but falsifies the theory's proposed mechanism. Researchers found that the interaction energy driving the process fails to increase monotonically in every model and prompt tested, while collapse speed is governed by the spectral radius of the value matrix rather than model dimension. The findings, based on experiments with real transformers, validate three core predictions—clustering, persistence, and separate two-timescale dynamics—while revealing that the theory's assumptions about energy monotonicity and depth-driven collapse do not hold in practice.

This is the first entry in a sequence.Over about ten parts, this series will work through a few humble experiments that test a mathematical theory of attention against real trained transformers.A project summary:a recent paper by Geshkovski, Letrouit, Polyanskiy, and Rigollet models attention as a dynamical system on the sphere and proves that tokens cluster and drift toward consensus, with a metastable two-timescale structure along the way — under the assumption that Q, K, and V are all the identity. This sequence asks how much of that survives in trained models most of it does , finds the one prediction that failsuniversally the energy is not monotone , and then spends the rest of the series tracking that failure to its mechanism in the value matrix. Links: all the code for every phase. GitHub repository : Geshkovski et al., the theory this whole investigation is built on Original paper : a video going over the original paper for those who prefer video YouTube walkthrough : This is the first in what will be a sequence of ~10 posts. Each post will be aligned with a phase from the github. Metastability is real in trained transformers — but the mechanism the theory proposes for it is not. Tokens cluster into metastable groups the way the idealized model predicts; the energy that theory says drives the process is violated in every model tested; and how fast a model collapses is set by its value matrix, not by depth or width as the theory claims. Three predictions of the idealized theory survive the move to trained models, and survive cleanly. Token representations cluster over layers CONFIRMED, every model, every real prompt . The clusters persist across runs of layers — metastable plateaus — before reorganizing CONFIRMED . And the two timescales of metastability — fast formation, slow merging — are genuinely separate above a depth threshold CONFIRMED . Two predictions break. The interaction energy the theory requires to increase monotonically falls somewhere in every model, every prompt, every temperature tested — the clustering is real but it is not the pure-attraction gradient flow the theory describes FALSIFIED . And collapse speed is governed by the spectral radius of the value matrix , not by model dimension; the higher-dimensional model in our matched pair collapses slower , the reverse of what Theorem 6.1 predicts FALSIFIED . Two honest caveats carry through: the GPT-2 attention-routing result is suspended because the causal mask can manufacture the signal on its own, and the two-timescale ratio magnitudes are provisional pending a rerun under a corrected definition. The clustering also does not appear in an untrained model with the same architecture, which is what makes "the network learns to do this" a finding rather than a property of the wiring. Scope. Everything here measures what happens to token representations inside the stack — the geometry of the residual stream, layer by layer. It is not a claim about what the model outputs or how it behaves. When a model's tokens collapse to a near-single point in the residual stream, that is a statement about internal geometry; whether and how it affects the model's predictions is a separate question Phase 2 takes up. Some Terminology - Sphere projection — each token's vector is divided by its length, so all tokens live on the unit sphere. Similarity between two tokens is their inner product cosine of the angle : +1 same direction, 0 orthogonal, −1 opposite. - β inverse temperature — how sharply attention concentrates. β = 0 is uniform attention; large β is near-hard selection. The theory's metastability is a large-β phenomenon. - Mass-near-1 — fraction of token pairs with inner product above 0.9. A direct read on "how much of the cloud has clustered." Goes from 0 to 1. - Effective rank — how many dimensions the token cloud effectively spans. Falls toward 1 as the cloud collapses to a point. - CKA — a single 0-to-1 score for how similar one layer's whole geometry is to the next. Near 1 across several layers = a plateau. - HDBSCAN — a clustering algorithm that discovers the number of clusters from the data and can label a token "noise" if it belongs to no cluster yet. - Plateau — a run of layers over which a signal stays flat; the operational stand-in for a metastable window. - Fiedler value λ₂ — a graph measure of how close the attention graph is to falling into separate pieces. Near 0 = attention split into clusters; large = everything mixing. - E β interaction energy — a scalar the theory's dynamics must push monotonically up . A step where it falls is a "violation" — evidence of a repulsive force the idealized model cannot produce. - ρ V spectral radius — the largest eigenvalue magnitude of the value matrix; the per-layer amplification factor that sets collapse speed. - OV circuit — the composed output-times-value map W O W V; the residual-stream-to-residual-stream operator Phase 2 argues is the right object to eigendecompose Phase 1 uses V alone as a proxy . The paper this post examines proves that, in an idealized transformer, token representations sitting on the unit sphere are pulled together by attention, group into clusters, and in the long run collapse to a single point. The proof is clean, but it holds under an assumption no trained model satisfies: that the query, key, and value projections are all the identity matrix. This post answers one question. Does the clustering-and-metastability picture the paper establishes for the idealized model show up in real trained transformers — and where does it break once Q, K, and V are arbitrary learned matrices? This is Phase 1: what happens . We measure whether the predicted behavior appears, and we record exactly where the trained models diverge from the theory. Phase 2, which follows, takes the divergences Phase 1 finds and asks why — what mechanism in the learned weights produces them. The model: every token is a point on the unit sphere. At each layer, attention moves each token toward a similarity-weighted average of the others — tokens it already points toward pull on it hardest, with the sharpness of that weighting set by an inverse temperature β. Similar tokens attract. Iterate the map and the tokens drift together; the long-run limit is a single point, every token identical. Consensus. The part that makes the theory interesting is not the endpoint but the route to it. For large β the collapse does not happen all at once. It happens on two separate timescales. First, fast: the cloud snaps into a small number of distinct clusters. Then, slow: those clusters merge pairwise, one absorption at a time, until only one survives. Between the moment the clusters form and the moment the last two merge, the structure is effectively frozen — a long window in which the cluster count holds steady while the dynamics idle. That frozen window is metastability , and the paper conjectures it appears for large β, before the final collapse. A claim about metastability is therefore a claim about time : not just that clusters exist, but that they persist for a stretch that is long relative to how quickly they formed. The gap that makes this an empirical project is the Q = K = V = I assumption. Under it, attention is pure symmetric attraction with no learned reshaping. A trained model has none of that. The query and key matrices reshape which tokens count as similar in the first place; the value matrix reshapes how a token actually moves once it is pulled. The idealized dynamics are purely attractive — they can only ever bring tokens together — but a learned value matrix can carry negative or complex eigenvalues, which would let a layer push tokens apart . None of the paper's convergence guarantees transfer directly once these matrices are arbitrary. The question is how much of the qualitative behavior survives the move from the idealized map to a learned one. Concretely, the paper leaves several open problems; this post tests three of them, by content rather than by number. The first is whether the clustering and the two-timescale metastability survive in trained models at all Groups A and B . The second is whether the gradient-flow structure survives — the theory's dynamics ascend a specific energy, and that energy must increase monotonically; we test whether it does Group D . The third concerns what governs convergence speed: the paper's Theorem 6.1 ties faster collapse to higher dimension, and we test that prediction directly Group E . One instrument group sits alongside these to check the mechanism — whether attention itself, not just the resulting geometry, shows the structure the theory requires Group C . Seven models, chosen so that three independent axes can be read off the same measurements, plus one untrained control. The GPT-2 family — small, medium, large, and xl — is the depth sweep: 12, 24, 36, and 48 transformer blocks at model dimensions of 768, 1024, 1280, and 1600, all sharing the same decoder architecture and causal attention mask. Holding the architecture fixed and varying depth is what lets any depth-threshold claim be tested within a single family. BERT-base is the encoder counterpart: 12 layers, dimension 768, bidirectional attention. Comparing it against GPT-2 isolates the encoder-versus-decoder axis, and — as it turns out — exposes a confound, since GPT-2's causal mask can manufacture cluster-like attention structure on its own Group C . ALBERT-base and ALBERT-xlarge supply the third axis. ALBERT shares a single transformer layer's weights across all of its depth, applying the same map repeatedly. That makes the layer index a literal iteration count, so for ALBERT "depth" is a clean time knob — the closest thing among real models to the paper's iterated map, and the natural place to look for two-timescale behavior. We run ALBERT to extended iteration counts a dense sweep up to ~48–60 iterations precisely to give the slow timescale room to play out. The two ALBERT sizes also give a dimension contrast at otherwise matched architecture: base at dimension 768, xlarge at 2048. That single pair is the decisive test of the Theorem 6.1 dimension prediction Group E , because everything except dimension is held roughly fixed. A randomly-initialized albert-base-v2-random — the ALBERT-base architecture with untrained weights — is run as a control. It answers the question that makes the whole project non-trivial: is the clustering learned , or does the architecture produce it with any weights at all? If a random-weight model clustered the same way, "trained transformers cluster their tokens" would be a statement about the wiring, not about what training discovers. This is the null the universality claims are measured against, and its result belongs in the verdict table. Prompts. The prompt set is deliberately diverse, so that "universal" means across language, modality, and length , not across four similar English paragraphs. The core English prose prompts span a wide length range — short heterogeneous ~23 tokens, semantically mixed , paper excerpt ~306 tokens , wiki paragraph ~450 tokens , and sullivan ballou ~489 tokens — and the set has since been extended to roughly nine real prompts adding at least one non-English prompt, source code , and LaTeX / mathematical notation , plus a couple of others. The diversity is itself a result: if clustering, plateaus, and energy violations appear in code and in equations as readily as in English prose, the phenomenon is a property of the dynamics, not of natural-language statistics. It also opens a sub-question worth answering explicitly in the synthesis — does any prompt class behave differently? If code or LaTeX clusters on a visibly different timescale, that is a new sub-finding; if it does not, "even code and equations cluster the same way" is a strong line. A length-sweep facility can truncate wiki paragraph to a series of token targets wiki 64 , wiki 128 , … to separate prompt length from prompt content as a variable. The degenerate control, repeated tokens , is a string of one token repeated. It has no real structure to cluster around, so it serves one purpose only — a collapse-speed diagnostic — and is excluded from every metastability aggregation , reported separately under a collapse-controls heading. An earlier version of the two-timescale ratio misused this control's collapse layer as a denominator and produced a self-contradictory table; that error is corrected in Group E, and the control is now confined to its diagnostic role. What we extract. For every model, prompt, layer triple, two raw objects: the residual-stream activations and the attention weights. The activations are L2-normalized onto the unit sphere — one line of code — and that projection is the entire bridge between the paper's geometry and the model's internals. From the sphere-projected activations we form the per-layer Gram matrix of pairwise inner products. Every quantity in every group is computed from these objects: the Gram matrix, the normalized activations, and the attention weights. Nothing else is stored; everything else is derived. The analysis runs in five groups, each answering one question, in an order that builds from existence to mechanism to the theory's sharpest prediction. Group A asks whether clusters form at all. Group B asks whether they persist — the operational test of metastability. Group C turns to the attention matrix and asks whether it carries the same story the geometry does. Group D tests the theory's most falsifiable claim, that a specific interaction energy increases monotonically. Group E asks whether the two timescales are genuinely separate, and what governs the gap. Each group below opens with its answer, then the question, then the finding. The method and concept boxes are in dropdowns; the result does not depend on opening them. Question: does the token cloud actually reorganize into distinct groups as depth increases, and does the reorganization look like what the theory predicts? The three instruments in this group each answer a slightly different version of the same question. The inner-product histogram and mass-near-1 ask whether the raw pairwise geometry is changing. Effective rank asks whether the token cloud is collapsing onto fewer dimensions. HDBSCAN asks whether that collapse is structured — i.e., whether distinct groups are forming rather than a smooth smear. Together they triangulate on "yes, clusters form" without relying on any single algorithm or threshold. Every instrument in this group reads from two objects computed once per layer: the sphere-projected activations and their Gram matrix. Everything downstream is a function of those two. The projection is the entire bridge between the paper's geometry and the model's residual stream — one line: python import torch.nn.functional as Fdef layernorm to sphere activation: torch.Tensor - torch.Tensor: """L2-normalize each token vector onto the unit sphere.""" return F.normalize activation, p=2, dim=-1 The per-layer loop then computes the normed activations and the Gram matrix once, and hands both to every metric below: normed = layernorm to sphere activations .numpy n tokens, d — on S^{d-1}G = normed @ normed.T n tokens, n tokens — ⟨x i, x j⟩ G i, j is exactly $\langle x i, x j \rangle$. Every quantity in Group A is read off G or off normed . At each layer, we compute the pairwise inner products between every pair of token representations and histogram them. Mass-near-1 is the fraction of pairs above 0.9. Why this instrument: The paper's entire theoretical machinery is built around what happens to pairwise inner products over time. Its own Figure 1 shows this exact histogram for ALBERT XLarge. Running the same diagnostic is the most direct possible replication: if the paper's qualitative story holds in trained models, this histogram should show a spike migrating toward 1 as depth increases. If the spike never forms, the clustering story is over before it starts. <details <summary What are inner products on the sphere, and how do you read the histogram?</summary Layer normalization — specifically RMS normalization, which every model here uses — divides each token's residual-stream vector by its Euclidean norm before passing it to the next layer. After this operation, every token vector $x i$ satisfies . The token cloud lives on the unit sphere . This matters because it changes what "distance" means. On the sphere, the natural measure of similarity between two points is their inner product: where is the angle between the two vectors. The inner product is exactly the cosine similarity when both vectors are unit length. The range is : For a layer with $n$ tokens, there are distinct pairs. Compute every pairwise inner product and put them in a histogram. The x-axis runs from to . The y-axis is density normalized so the histogram integrates to 1 . Reading the histogram across layers: Early layers — spread near 0. In high dimensions, random vectors on the sphere concentrate near the equator. By concentration of measure, most pairwise inner products between randomly-oriented high-dimensional unit vectors land close to 0. A histogram piled near 0 at early layers isn't surprising — it's the expected baseline before the dynamics have done anything. This is true for , which is always the case here . Clustering in progress — a spike growing at 1. As layers push similar tokens together, a subset of pairs achieves high inner product. The histogram develops a second peak migrating toward 1 while the main mass stays near 0. Each such peak is a cluster: the pairs at high inner product are tokens that have been pulled together. Full collapse — everything at 1. When all tokens have merged into a single point on the sphere, every pairwise inner product equals 1. The histogram is a single spike at 1. This is what the theory's long-run prediction consensus / a single Dirac mass looks like empirically. Tracking the full histogram across all layers, models, and prompts produces an unwieldy number of plots. Mass-near-1 compresses the histogram into a scalar: the fraction of pairs with inner product above a threshold we use 0.9 . This is a direct, threshold-based read on "how much of the token cloud has clustered so far." A layer where mass-near-1 goes from 0 to 0.4 in one step is a layer where 40% of pairs snapped into high agreement at once — a merge event. One alternative is Euclidean distance. We don't use it here because the paper's energy, theorems, and convergence results are all stated in terms of inner products on the sphere. Using the same quantity makes comparison direct. Also, on the unit sphere, , so Euclidean distance is a monotone function of inner product anyway — they carry the same information, just on different scales. Another alternative is cosine similarity computed on un-normalized activations. We don't use it because layer normalization is part of the model's computation, not a preprocessing choice. The normalized representations are what the model is computing with; measuring inner products on them is measuring the quantity the paper's dynamics track. Once mass-near-1 reaches near 1.0, the token cloud is essentially a point. This is consistent with the paper's prediction but it also means the model has lost all token-level diversity — every token has the same representation. Whether this happens in the residual stream before the output head, and what it implies for the model's function, is a separate question. Phase 1 documents when and whether it happens; Phase 2 investigates why. The code: from Gram matrix to histogram and mass-near-1 The pair extraction is one helper. It takes the upper triangle of G each pair once, no self-pairs : php def pairwise inner products from gram G: np.ndarray - np.ndarray: """Upper-triangle pairwise cosine similarities from a pre-computed Gram matrix.""" n = G.shape 0 idx = np.triu indices n, k=1 k=1 drops the diagonal self-pairs = 1.0 return G idx The histogram, the summary scalars, and mass-near-1 are then four lines in the per-layer loop: ips = pairwise inner products from gram G lr "ip mean" = float ips.mean lr "ip std" = float ips.std lr "ip histogram" = np.histogram ips, bins=50, range= -1, 1 0 .tolist lr "ip mass near 1" = float ips 0.9 .mean The last line is the formula above, verbatim: ips 0.9 is the indicator $\mathbf{1} \langle x i, x j \rangle 0.9 $, and .mean over the upper triangle is the division by $\binom{n}{2}$. The histogram uses 50 fixed bins on $ -1, 1 $ — fixed range, so the bin edges are comparable across every layer, model, and prompt without renormalizing. This is the array the Figure-1 replication plots: results "layers" li "ip histogram" is fed directly to the bar chart, one panel per layer. There is no separate code path for the histogram and for mass-near-1 — they read the same ips array. The scalar is just the histogram with everything above the 0.9 bin edge summed. What came out: Clustering is universal — every model, every real prompt, develops the migrating spike at 1. The endpoint is where architecture splits. ALBERT-base drives MaxMass to ~1.0 full collapse . ALBERT-xlarge, despite being the model the paper's own Figure 1 uses, stays below ~0.30 mass-near-1 even at 48 iterations — its dynamics are slow enough that 48 layers is not enough to collapse. The GPT-2 family clusters hard on long prompts gpt2-small reaches ~0.87–0.97 but does not collapse its degenerate control, which is the expected signature: the control has no real structure to cluster around. At each layer, we compute how many dimensions the token cloud is effectively spread across — a continuous, threshold-free measure of geometric collapse. Why this instrument: Mass-near-1 measures whether pairs have aligned. Effective rank measures whether the whole cloud is collapsing onto a low-dimensional subspace. These are related but distinct: you can have many aligned pairs without full collapse multiple separate clusters , and the way effective rank falls tells you about the geometry of the collapse process. It also serves as the degeneracy gate for several other metrics — when the cloud is essentially a point, those other metrics become meaningless and we suppress them rather than report noise. What is effective rank, and why does the entropy form matter? The naive answer to "how many dimensions is this cloud using" is: count the singular values of the activation matrix that are above some threshold. This is fragile. The threshold is arbitrary — changing it from 0.01 to 0.001 can double the count. There's no principled choice, and the right threshold varies across models, layers, and prompt lengths. Take the activation matrix n tokens, d dimensions per token . Compute its singular values . Convert them to a probability distribution by normalizing, then take the exponential of the Shannon entropy: p k = \frac{\sigma k}{\sum j \sigma j}, \qquad \text{effective rank} X = \exp \left -\sum k p k \log p k\right This is the effective rank of Roy and Vetterli 2007 . See the code box below for an implementation note: an earlier draft of this section wrote the normalization over squared singular values ; the implemented metric normalizes the singular values directly, which is the Roy–Vetterli definition. The two differ. Entropy measures how spread-out a distribution is. A distribution concentrated on one value has entropy 0; a uniform distribution over values has entropy . Exponentiating maps entropy back to a "number of effective components." Applied to the singular value spectrum: The key advantage is that it's smooth and weights directions by their share of the spectrum. A direction carrying a negligible fraction of the total barely contributes, even though it would count as a "nonzero singular value" in a threshold-based count. Effective rank tends to start high at early layers many directions contribute roughly equally and fall as clustering progresses token representations align, the spectrum concentrates in fewer directions . The endpoint depends on architecture: Unlike mass-near-1, effective rank tracks the full distribution of the spectrum, not just the high-agreement pairs. A model with two perfectly separated clusters of equal size will show mass-near-1 = 0 no cross-cluster pairs near 1 but effective rank ≈ 2 two principal directions . They're measuring different things and the combination is more informative than either alone. When effective rank drops low enough, the token cloud is essentially a point-mass in all but name. At this point: CKA between consecutive layers goes trivially to 1. Any two representations that are both near-point-masses have nearly identical Gram matrices — CKA can't distinguish them. A CKA value of 0.99 at a degenerate layer means "both layers are collapsed," not "the geometry is stable in an interesting way." Nearest-neighbor assignment becomes noise. When all tokens are nearly identical, which token is nearest to which is determined by floating-point rounding at the scale of $10^{-6}$. NN-stability values computed here are not interpretable. Spectral cluster counts are meaningless. The Laplacian eigengap method finds clusters in a graph; a near-point-mass graph has no structure to find. Rather than report these quantities and let them corrupt the analysis, we gate them on effective rank and suppress them below the threshold see code box for the exact gate values . The effective rank itself is still informative at rank 2 — a 2D cloud on the sphere can still have meaningful structure two clusters separated by a great circle . There are two variants. Raw effective rank is computed on the activation matrix before sphere-projection; it reflects both directional spread and how much the residual norms vary across tokens. Normed effective rank is computed on the sphere-projected activations and measures purely directional spread. For the "clustering on the sphere" story — what the paper's dynamics are about — normed is the conceptually right quantity, but the degeneracy gate uses raw effective rank because it is more conservative it folds in norm collapse too, avoiding false negatives . The code: the metric and the gate The metric itself is five lines: python from scipy.linalg import svdvalsdef effective rank from raw activations: torch.Tensor - float: """ Spectral entropy of singular values. SVD runs on raw activations, not L2-normed ones. L2 normalization sets every token's norm to 1, collapsing the inter-token scale variation that the singular values measure — svdvals normed gives a different answer. Named from raw to make the contract explicit at call sites. """ sv = svdvals activations.numpy sv = sv sv 1e-10 drop numerical zeros sv norm = sv / sv.sum normalize to a probability distribution entropy = -np.sum sv norm np.log sv norm + 1e-12 return float np.exp entropy Implementation note — squared vs. unsquared. The concept box above in an earlier draft wrote $p k = \sigma k^2 / \sum j \sigma j^2$, i.e. normalizing the variances / eigenvalues of $X^\top X$. The code normalizes the singular values directly, sv / sv.sum . These are not the same function; the squared form puts more weight on the leading directions and reports a lower effective rank. The implemented version is the original Roy–Vetterli 2007 definition. If you want the variance-weighted version, square sv before the normalize line. The findings below are computed with the unsquared form as shown. Two rank variants are now computed per layer. The raw rank drives every degeneracy gate; the normed rank directional spread on the sphere, independent of residual-norm growth is kept for reporting: Raw: captures scale + directional collapse; used for all gates. Normed: directional spread on the sphere only; for reporting.lr "effective rank" = effective rank from raw activations lr "effective rank normed" = effective rank from normed normed The gate is then applied at the call site. Effective rank is computed first, then used to decide whether CKA is even meaningful at this layer: python from core.config import DEGENERATE RANK THRESHOLD = 2 CKA vs previous layer — suppressed when the cloud is a near-point-mass. Centering then produces noise-dominated vectors and the Frobenius norms collapse to near-zero, so the ratio is numerically meaningless.if prev normed is not None and lr "effective rank" = DEGENERATE RANK THRESHOLD: lr "cka prev" = linear cka normed, prev normed else: lr "cka prev" = float "nan" The gate is a single shared constant, DEGENERATE RANK THRESHOLD = 2 , used identically by CKA, NN-stability, and energy-drop suppression. It was previously split CKA gated at = 3.0 , NN at = 2.0 ; unifying it at 2 matches the geometric claim in the concept box above — at rank 2 the cloud is still genuinely 2-D on the sphere two clusters separated by a great circle , so CKA and NN remain meaningful, while rank ≈ 1 is the point-mass regime where they become float-noise. Degenerate layers drop out of the plateau analysis rather than contributing false near-1 values. Raise the threshold in one place if post-rerun rank-2 CKA turns out to be erratic. how we know the metric is correct unit tests Two degenerate cases pin the behavior at both ends of the range: python def test rank1 matrix returns one self, rank1 tensor : Every row identical outer product v⊗w → one non-zero singular value → entropy 0 → effective rank 1. rank = effective rank from raw rank1 tensor assert rank == pytest.approx 1.0, abs=1e-4 def test uniform sv returns d self, uniform sv tensor : D equal singular values orthonormal columns from QR → entropy log D → effective rank exp log D = D. rank = effective rank from raw uniform sv tensor assert rank == pytest.approx D, abs=0.5 A fully collapsed cloud returns 1.0; a maximally spread one returns $d$. Anything observed in between is a real interpolation, not an artifact of the estimator. What came out: Effective rank tracks mass-near-1 inversely — as the spike at 1 grows, the cloud collapses onto fewer directions, exactly as it should if the two are measuring the same underlying process from different angles. The architecture split reappears here. ALBERT-base on a Wikipedia paragraph collapses to MinRank ~1.4 essentially a point . ALBERT-xlarge on long prompts stays above ~55 — it never enters the degenerate regime within its depth. The GPT-2 family collapses hard: gpt2-small to ~1.59, gpt2-large to ~5.45. The gate fires often enough on the collapsing models that their late layers are correctly excluded from CKA and NN-stability rather than reported as spurious stability. HDBSCAN is a density-based clustering algorithm that discovers the number of clusters from the data and assigns a "noise" label to tokens that don't belong to any cluster. Why this instrument: Effective rank tells you the cloud is collapsing onto fewer dimensions. HDBSCAN asks whether that collapse is structured into distinct groups. It's the primary cluster-membership method throughout the analysis, for two reasons. First, we don't know the number of clusters in advance — it varies by model, layer, and prompt, and an algorithm that requires k upfront would require us to assume the answer. Second, "this token belongs to no cluster yet" is a meaningful state during metastability, and HDBSCAN can say that honestly rather than forcing every token into the nearest group. <details <summary What is HDBSCAN, and why does it handle this setting better than k-means?</summary You have $n$ token vectors on the sphere at a given layer. You want to know: how many distinct groups are there, and which tokens belong to which group? The answer should handle: K-means fails on all three counts. K-means partitions $n$ points into exactly $k$ clusters by minimizing the sum of squared distances from each point to its assigned centroid. The problems in this setting: Requires $k$ upfront. During metastability the number of clusters is the quantity we're trying to measure. Specifying it is assuming the answer. Forces every point into a cluster. A token in transition between two clusters — on the boundary, not yet committed — gets assigned to whichever centroid is closer. This manufactures false membership. Assumes spherical, equally-sized clusters. K-means Voronoi cells are convex and symmetric. Real token clusters on the sphere can be elongated, have different densities, and different sizes. K-means will split large clusters and merge small ones to equalize sizes. Non-deterministic. K-means depends on centroid initialization and can find different local minima on different runs. This makes results harder to reproduce and interpret. DBSCAN Density-Based Spatial Clustering of Applications with Noise takes a different approach: define clusters as dense regions separated by sparse regions. The core insight: if tokens are genuinely clustering, there should be regions of high local density the cluster cores separated by low-density gaps. A point is a "core point" if it has at least minPts neighbors within distance $\epsilon$. Clusters grow by linking neighboring core points. Points reachable from a core point but not core themselves are border points. Points in no core point's neighborhood are noise label = $-1$ . This handles variable-shape clusters naturally — density is local, so elongated clusters work fine. And the noise label is honest: it doesn't force boundary tokens into clusters. The problem with vanilla DBSCAN: it uses a flat density threshold $\epsilon$ everywhere. This fails when clusters have different densities. A threshold that captures a sparse cluster will merge two nearby dense clusters; a threshold that separates those dense clusters will dissolve the sparse one. HDBSCAN McInnes and Healy, 2017 fixes this by building a full density hierarchy. Step 1: Mutual reachability distance. For each point, compute its core distance $d {\text{core}} p $ — the distance to its $k$-th nearest neighbor. The mutual reachability distance between points $p$ and $q$ is: In sparse regions, distances are "stretched" replaced by the larger core distances , so sparse points are pushed apart. In dense regions, points are already closer than their core distances, so nothing changes. The result is a distance metric robust to density variation. Step 2: Minimum spanning tree MST . Build the MST on the full $n \times n$ mutual reachability distance matrix. This is the skeleton of the density structure. Step 3: Build the condensed cluster tree. Simulate removing MST edges in order of increasing mutual reachability distance equivalently: decreasing density threshold , tracking which clusters split off. This produces a dendrogram of cluster births and deaths. Step 4: Extract stable clusters. Each branch has a "persistence" — how long it survives as a distinct cluster as the threshold rises. The algorithm selects the most persistent set of non-overlapping clusters. Branches that split off and die quickly are absorbed back as noise. The result is cluster assignments for tokens in stable dense regions, a noise label $-1$ for tokens in low-density regions or transitions, and a cluster count that was discovered , not specified. During metastability, some tokens have committed to a cluster and others haven't yet. The noise label captures this. A layer where 30% of tokens are labeled noise isn't a failure of the algorithm — it's a signal that the clustering process is mid-transition. Tracking the noise fraction across layers is itself a diagnostic. We apply HDBSCAN to the sphere-projected token vectors using cosine distance $1 - \langle x i, x j \rangle$ rather than Euclidean distance. This is appropriate because the tokens live on $\mathbb{S}^{d-1}$ and the paper's dynamics are stated in terms of angular relationships. For most models, we cross-validate the cluster count from HDBSCAN against the spectral eigengap method Group B . For ALBERT-xlarge, the spectral method fails: a zero mode dominates the Laplacian and it returns $k = 1$ at every layer regardless of actual structure. HDBSCAN doesn't have this failure mode — it operates on local pairwise distances, not global eigenstructure. So for ALBERT-xlarge, all cluster counts, merge events, and plateau identifications are sourced from HDBSCAN. Any nMerges = 0 in a spectral-sourced column for ALBERT-xlarge is an artifact of spectral degeneracy; genuine merges are recorded by the HDBSCAN trajectory tracker see Group B . </details <details <summary T</summary The code: HDBSCAN on cosine distance, with the noise label Cosine distance is precomputed from the normed activations, then handed to HDBSCAN as a precomputed metric: python from sklearn.metrics import pairwise distancesimport hdbscan 1 − ⟨x i, x j⟩ on the sphere; clip removes tiny negative round-offcos dist = np.clip pairwise distances normed, metric="cosine" , 0, None hdb = hdbscan.HDBSCAN min cluster size=2, metric="precomputed" hdb labels = hdb.fit predict cos dist.astype np.float64 −1 = noise Cluster count excludes the noise labeln clusters = len set hdb labels - 1 if -1 in hdb labels else 0 Three things to read off this: metric="precomputed" means HDBSCAN never sees the vectors, only the cosine-distance matrix. The angular geometry is the only thing it clusters on — consistent with the paper's inner-product dynamics. min cluster size=2 is deliberately permissive: a "cluster" can be a single merged pair, which is what an early merge event looks like. n clusters subtracts 1 if and only if the noise label is present. The noise tokens -1 are counted separately, never folded into a cluster. The noise hdb labels == -1 .mean — is tracked per layer as the mid-transition signal described above.The labels array is what feeds every downstream consumer: the trajectory tracker that records merges across layers Group B , the per-pair semantic/artifact tagging, and the multi-scale nesting analysis all take hdb labels as their starting point. </details <details <summary The code: why not k-means — the test that pins the failure modes</summary The cluster-count sweep runs both HDBSCAN and a k-means silhouette search, which makes the k-means failure modes directly observable in the tests: python def test antipodal kmeans best k is 2 self, antipodal normed : Two tight clusters at opposite poles. Silhouette peaks at k=2. result = cluster count sweep antipodal normed assert result "kmeans" "best k" == 2def test collapsed agglomerative count is 1 self, collapsed normed : Fully collapsed cloud: every cosine distance ~1e-6, below all thresholds. Must return a single cluster. result = cluster count sweep collapsed normed assert result "agglomerative" MID THRESH == 1 K-means recovers the right answer only when you already know to look for $k = 2$ and the clusters happen to be the clean, equal-sized, antipodal case it assumes. HDBSCAN gets the count without that assumption, which is why it — not k-means — is the authoritative membership source throughout. What came out: At the plateau layers identified in Group B, HDBSCAN recovers semantically coherent groups reproducibly — the discovered clusters are not run-to-run artifacts. ALBERT-xlarge trajectories are shorter and noisier than GPT-2's, consistent with its slower dynamics: the clusters are still forming and dissolving rather than locking in. For ALBERT-xlarge specifically, HDBSCAN is the authoritative cluster count, because the spectral method degenerates to $k=1$ there and would otherwise report no structure at all. Code excerpts are from core/models.py layernorm to sphere , core/metrics.py pairwise inner products from gram , effective rank from raw , core/clustering.py cluster count sweep , analysis.py per-layer loop and gates , and the Phase 1 test suite. Snippets are lightly trimmed comments and unrelated branches removed but otherwise verbatim. Finding — CONFIRMED.Clustering is universal: every model develops the migrating histogram spike at +1, and effective rank falls inversely as mass-near-1 rises — two instruments measuring the same process from different angles. The split is in the endpoint. ALBERT-base drives mass-near-1 to ~1.0 and effective rank to ~1.4 full collapse to a point . ALBERT-xlarge — the very model the paper's Figure 1 uses — stays below ~0.30 mass-near-1 and above ~55 effective rank even at 48 iterations: its dynamics are slow enough that 48 layers is not enough to collapse it. The GPT-2 family clusters hard on real prompts gpt2-small reaches ~0.87–0.97 but does not collapse its degenerate control — the expected signature of structure-driven clustering. At the plateau layers Group B , HDBSCAN recovers semantically coherent groups reproducibly, not run-to-run artifacts. For ALBERT-xlarge specifically, HDBSCAN is the authoritative cluster count, because the spectral method degenerates to k = 1 there. Question: once clusters form, do they stay stable across multiple layers — or do they immediately dissolve? This is the operational test of metastability. Group A established that clusters form. That alone is not metastability. Metastability is a claim about time : the structure has to survive for a stretch of layers — a plateau — before the slow merging finishes the collapse. If clusters formed and immediately dissolved, the histogram spike would still appear, but there would be no metastable window. So Group B measures persistence directly, from four angles. Two of the instruments measure whether the representation is holding still : CKA compares the whole geometry of one layer to the next, and nearest-neighbor stability asks the same question at the discrete level of individual token relationships. The third turns "holding still" into a concrete window — the plateau-detection routine that decides where a metastable stretch starts and ends. The fourth follows individual clusters through those windows and records when they merge, which is the actual mechanism the slow timescale is made of. Everything here reads from the same per-layer objects as Group A — the sphere-projected activations, the Gram matrix, and the HDBSCAN labels — plus one new cross-layer dependency: each instrument compares layer $L$ to layer $L-1$, so the analysis loop carries the previous layer's state forward. At each layer, we compare the entire representational geometry to the previous layer with a single similarity score in $ 0, 1 $. A run of values near 1 is a plateau; a sharp drop marks its end. Why this instrument: "Do clusters persist" is a question about whether the layer-to-layer map is approximately the identity over some stretch. CKA answers it at the level of the whole representation rather than any one cluster: it asks whether the pairwise-similarity structure of all tokens is preserved from one layer to the next. It is the primary plateau signal because it is basis-free and scale-invariant — it does not care how the residual stream rotates or rescales between layers, only whether the relational geometry is the same. <details <summary What is CKA, what is HSIC, and what does it actually hold invariant?</summary At layer $L$ you have an $n \times d$ activation matrix $X$ sphere-projected . At layer $L-1$ you have $Y$, same shape. You want one number saying how similar these two representations are — not token-by-token, but as geometries. The natural object is the representational similarity matrix RSM : the $n \times n$ Gram matrix $XX^\top$, whose $ i,j $ entry is $\langle x i, x j \rangle$. Two representations are "the same geometry" if their RSMs agree, regardless of how the underlying coordinates are oriented. This is exactly the inner-product matrix from Group A — CKA is built on top of the same object. The Hilbert-Schmidt Independence Criterion measures whether two sets of features are statistically related. For linear kernels and , the unnormalized linear HSIC reduces to after centering. This is the alignment between the two RSMs: it is large when pairs of tokens that are close in $X$ are also close in $Y$. HSIC on its own scales with the magnitudes of $X$ and $Y$, so it is not comparable across layers. CKA normalizes it by the self-similarities: The result is in $ 0, 1 $: 1 means the two RSMs are identical up to an orthogonal transform and isotropic scaling; 0 means they are orthogonal. This is the property that makes CKA the right plateau instrument: So a CKA plateau near 1 means: across these layers, the model is rotating and rescaling the token cloud but not reorganizing which tokens are close to which. That is precisely the metastable picture — the clusters are fixed, the dynamics are idling. CKA is computed on mean-centered activations. Without centering, a single large shared component — a token like CLS that every representation has a big projection onto — inflates the similarity regardless of the actual cluster structure. Centering removes that shared offset so CKA reflects the relational structure, not a common bias. The code: the metric and the plateau read-off The metric is the normalized HSIC formula, verbatim: php def linear cka X: np.ndarray, Y: np.ndarray - float: """ CKA X, Y = ||Y^T X|| F^2 / ||X^T X|| F ||Y^T Y|| F X, Y : n tokens, d — already L2-normed; both mean-centered internally. """ X = X - X.mean axis=0, keepdims=True Y = Y - Y.mean axis=0, keepdims=True ||Y^T X|| F^2 = tr X^T Y Y^T X = ||X.T @ Y|| F^2 YtX = Y.T @ X d, d numerator = float np.sum YtX 2 XtX norm = float np.linalg.norm X.T @ X, "fro" YtY norm = float np.linalg.norm Y.T @ Y, "fro" denom = XtX norm YtY norm if denom < 1e-12: return float "nan" return float np.clip numerator / denom, 0.0, 1.0 In the per-layer loop it is called only when the layer is non-degenerate the Group A gate , so collapsed layers return nan rather than a spurious 1.0: python from core.config import DEGENERATE RANK THRESHOLD = 2if prev normed is not None and lr "effective rank" = DEGENERATE RANK THRESHOLD: lr "cka prev" = linear cka normed, prev normed else: lr "cka prev" = float "nan" A CKA plateau is then just a flat run of this series. The plateau detector see B.3 is run on the CKA values with a tight tolerance, because CKA near 1 should be very flat inside a real metastable window: cka pairs = r "layer" , r "cka prev" for r in layers if not np.isnan r "cka prev" drop layer 0 + degenerate layerscka series = v for , v in cka pairs plateaus cka = detect plateaus cka series, window=2, tol=0.02 The end of a plateau is flagged separately as the sharpest single-step drop — the layer where consecutive-layer CKA falls the most is the clearest signal that a metastable window has ended: diffs = np.diff valid vs drop pos = int np.argmin diffs if diffs drop pos < -0.05: severity = "SHARP" if diffs drop pos < -0.15 else "MILD" marks L → L+1 as the plateau-ending reorganization What came out: CKA shows reproducible flat-near-1 runs punctuated by sharp drops — the plateau-and-merge signature the metastability conjecture predicts. The plateaus line up with the windows where the other signals mass-near-1, spectral-k, HDBSCAN-k are also flat, which is what gives the plateau identification in B.3 something to agree on. Late degenerate layers correctly drop out as nan rather than producing a trivial CKA = 1, so a collapsed model does not masquerade as a long stable plateau. Note: some report headers still print "effective rank < 3" as the suppression label — stale display text; the active gate is DEGENERATE RANK THRESHOLD = 2 . At each layer, for every token we record which other token is its nearest neighbor. NN-stability is the fraction of tokens whose nearest neighbor is unchanged from the previous layer. Why this instrument: CKA is a continuous, global measure — it can sit at 0.97 while the fine-grained cluster membership quietly churns underneath. NN-stability is the discrete complement: it tracks identities, not geometry. A token's nearest neighbor either changed or it didn't. This catches reorganization that CKA averages away — if tokens are swapping partners while the overall geometry stays roughly fixed, CKA stays high but NN-stability drops. Used together, a plateau where both are high is a much stronger persistence claim than either alone. <details <summary What NN-stability catches that CKA misses</summary For each token $i$ at a layer, its nearest neighbor is $\text{NN} i = \arg\max {j \neq i} \langle x i, x j \rangle$ — the token it points most directly at on the sphere. NN-stability between layer $L-1$ and $L$ is the fraction of tokens whose nearest neighbor index is identical: $$\text{NN-stability} = \frac{1}{n} \sum i \mathbf{1} \text{NN} L i = \text{NN} {L-1} i $$ 1.0 means every token kept the same nearest neighbor — a perfectly locked plateau. 0.0 means every token's nearest neighbor changed — the cloud is still reorganizing. CKA is invariant to rotation and measures aggregate relational geometry. Two failure modes it cannot see: Because each token has exactly one nearest neighbor, the NN map defines a functional graph every node has out-degree 1 . The cycles of this graph are the stable atoms: a mutual pair $i \leftrightarrow j$ is a 2-cycle, and longer cycles are tightly bound groups. At plateau layers, these cycles are the operational definition of a "stable cluster" used in the token-membership reporting — and they are tagged as semantic members are distinct token strings — structurally parallel tokens or duplicate members are the same token string — positional copies of one word . The semantic-vs-duplicate split keeps positional repeats from being counted as meaningful clusters. When the cloud collapses to a near-point-mass, every token is nearly equidistant from every other, and which one is "nearest" is decided by floating-point noise at the $10^{-6}$ level. NN-stability computed there is meaningless. So, like CKA, it is suppressed below the rank threshold. The code: NN indices, the per-layer stability, and degeneracy suppression The nearest neighbor of each token is one masked argmax on the Gram matrix: php def nearest neighbor indices G: np.ndarray - np.ndarray: """nn i = argmax {j≠i} G i, j , from a pre-computed Gram matrix.""" G masked = G.copy np.fill diagonal G masked, -np.inf exclude self G i,i = 1 return np.argmax G masked, axis=1 .astype np.int32 Stability is the fraction unchanged from the previous layer, computed inline in the loop layer 0 has no predecessor, so it is None : nn = nearest neighbor indices G lr "nn indices" = nn.tolist if prev nn is not None: lr "nn stability" = float np.mean nn == prev nn the indicator-mean formulaelse: lr "nn stability" = Noneprev nn = nn When NN-stability plateaus are extracted, degenerate layers are dropped first — the same rank gate as CKA, on the unified constant: nn stab defined = i, v for i, v in enumerate nn stab vals if v is not None and layers i "effective rank" = DEGENERATE RANK THRESHOLD nn series = v for , v in nn stab defined nn plateaus = detect plateaus nn series, window=2, tol=0.02 The semantic/duplicate split is read off the NN functional graph at plateau layers: cycles whose members are distinct token strings are semantic; cycles of identical strings are positional duplicates. That tagging is what lets the report distinguish a real structural cluster from the same word appearing three times. What came out: NN-stability plateaus coincide with CKA plateaus on the real prompts — the discrete and continuous measures agree on where the metastable windows are, which is the cross-check that makes the persistence claim credible rather than an artifact of one metric. At those layers the stable NN-cycles are predominantly semantic distinct tokens drawn together rather than duplicate, so the plateaus reflect structural grouping, not repeated tokens. Degenerate late layers are suppressed, so collapsed models do not report spurious perfect stability. A plateau is a contiguous run of layers over which a signal stays approximately flat. This routine turns the continuous traces above into concrete metastable windows with a start, an end, and a width. Why this instrument: "Metastability" only becomes testable once "the structure persists for a while" is an actual interval of layers. Plateau identification is the bridge from traces to windows. It is deliberately simple and applied identically to every signal, so a plateau in mass-near-1, a plateau in CKA, and a plateau in cluster count are detected by the same rule and can be compared. <details <summary What counts as a plateau, and how the signals are and aren't combined</summary A window of layers is a plateau if the signal's relative span across it stays below a tolerance. For a segment of values $v$, the criterion is Using relative span dividing by the mean makes the criterion scale-appropriate per signal — a tolerance that is sensible for a cluster count of 6 would be far too loose for a CKA value of 0.97. The detector starts from a minimum width, then greedily extends the window as long as the flatness criterion still holds, and records start, end, mean . Because the signals live on different scales, each gets its own tolerance: Signal | tol | Why | | |---|---|---|---| mass-near-1 | 0.10 | fraction in 0,1 , moves in chunks at merges | | effective rank | 0.05 | smooth, want tight flatness | | spectral-k / hdbscan-k | 0.5 | integer counts; 0.5 tolerates no real change | | CKA | 0.02 | near 1 inside a plateau; should be very flat | | NN-stability | 0.02 | same reasoning as CKA | This is worth stating precisely, because the intent and the implementation differ. The authoritative plateau set stored on each run plateau layers , step P1-7 is computed from mass-near-1 alone , with tol 0.10. That single signal defines the windows that downstream analyses semantic coherence at plateau layers, the two-timescale numerator actually use. Separately, a multi-signal vote is computed: for each layer, count how many of {mass, rank, spectral-k, hdbscan-k, fiedler, CKA} have a plateau covering it, and flag layers covered by ≥2. This stricter "joint" set is used in the flagged-anomalies cross-check, not as the operational plateau set. So the "we require all signals jointly" framing is aspirational relative to the current code: the operational windows are mass-near-1-driven, with the multi-signal agreement serving as a confirmation layer. The Fix 7 sensitivity script goes further — it sweeps a genuine joint criterion cka thresh , nn thresh , vote frac , min len to measure how much the plateau widths move under different definitions. The width uncertainty from that sweep is what feeds the error band on the two-timescale ratio. Until that rerun lands, plateau widths should be read as mass-near-1 windows, not consensus windows. The onset layer of the first plateau, compared across prompts for a fixed model, is itself a diagnostic. If onset barely moves with the prompt SD < 2 layers , the plateau is a weight-level property — the architecture clusters at a fixed depth regardless of input. If onset moves a lot SD ≥ 2 , clustering is content-driven. This is the measurement behind the "attractor-driven vs. content-driven" split in the synthesis. </details <details <summary The code: the detector, the operational set, and the joint vote</summary The detector is one greedy scan, identical for every signal: php def detect plateaus values: list, window: int = 2, tol: float = 0.05 - list: """Contiguous windows where relative span < tol. Returns start, end, mean .""" plateaus = n = len values i = 0 while i < n - window: segment = values i:i + window + 1 span = max segment - min segment ref = abs np.mean segment + 1e-8 if span / ref < tol: the flatness criterion j = i + window while j < n - 1: greedily extend extended = values i:j + 2 if max extended - min extended / abs np.mean extended + 1e-8 < tol: j += 1 else: break plateaus.append i, j, float np.mean values i:j + 1 i = j + 1 else: i += 1 return plateaus The operational plateau set is mass-near-1 only this is the plateau layers every downstream consumer reads : Post-loop P1-7 : the authoritative plateau setmass1 = r "ip mass near 1" for r in results "layers" plateaus = detect plateaus mass1, window=2, tol=0.10 plateau layer set = set for s, e, in plateaus: for l in range s, e + 1 : plateau layer set.add l results "plateau layers" = sorted plateau layer set The multi-signal vote is computed separately, as a stricter cross-check note it is not what plateau layers uses : layer count = Counter for group in plateaus mass, plateaus rank, plateaus spk, plateaus hdb, plateaus fied : for s, e, in group: for l in range s, e + 1 : layer count l += 1for s, e, in plateaus cka: CKA indexed by position → remap to layer for idx in range s, e + 1 : layer count cka pairs idx 0 += 1multi = l for l, c in sorted layer count.items if c = 2 ≥2 signals agree Prompt sensitivity turns the onset layer into the weight-level/content-driven verdict: onset vals = first plateau onset run for run in runs of this model sd = float np.std onset vals classification = "weight-level" if sd < 2.0 else "content-driven" </details What came out: Plateaus are reproducible — the same model and prompt yield the same windows across runs. The prompt-sensitivity split is the load-bearing result: the GPT-2 family clusters at weight-level onset low SD across prompts — the architecture has a fixed clustering depth , while BERT's onset is content-driven it moves with the input . This is the empirical basis for the two-regime story in the synthesis. The honest caveat carried forward: plateau widths are currently mass-near-1 windows; the consensus-window definition and its uncertainty band are pending the Fix 7 sensitivity rerun, and the two-timescale ratio inherits that uncertainty. Following individual HDBSCAN clusters layer by layer: matching each cluster to its continuation in the next layer, and recording when one is born, dies, or merges into another. Why this instrument: The plateau signals say that structure persists. Trajectory tracking says which structure persists and how it ends . The slow timescale of metastability is literally a sequence of pairwise merges — two clusters becoming one — and this is the instrument that records each merge as a discrete event at a specific layer transition. It is also the only correct merge-detection method for ALBERT-xlarge, where the spectral cluster count degenerates see Group A : the trajectory tracker works from token membership, not from the Laplacian spectrum, so it still records merges when the spectral method reports a flat $k=1$. <details <summary How clusters are matched across layers, and what a merge is</summary At layer $L$ HDBSCAN found some clusters; at layer $L+1$ it found some clusters. Which layer-$L$ cluster is which layer-$ L{+}1 $ cluster? They are the same cluster if they contain mostly the same tokens. The overlap measure is Jaccard: computed over token membership, with HDBSCAN noise tokens label $-1$ excluded. Given the full matrix of Jaccard overlaps between every layer-$L$ cluster and every layer-$ L{+}1 $ cluster, the optimal one-to-one matching is the assignment that maximizes total overlap. That is the Hungarian algorithm linear sum assignment , run on the negated overlap matrix because the routine minimizes cost. Matches below a minimum Jaccard 0.1 are discarded as too weak to be continuations. So the description is both "Jaccard overlap" and "Hungarian matching": Hungarian is the assignment procedure, Jaccard is the cost it optimizes. After the optimal matching: Matched clusters are chained across layers into trajectories. Each trajectory is a sequence of layer, cluster id pairs with a birth layer, a death or end layer, and a lifespan. Trajectory lifespans are the per-cluster version of the plateau width: a long-lived trajectory is a cluster that survived a long metastable stretch before merging. Group A explained that ALBERT-xlarge's Laplacian has a dominant zero mode that pins the spectral cluster count at $k=1$ regardless of real structure. A merge-detection method based on spectral-$k$ drops would therefore report zero merges there — an artifact. Trajectory tracking works entirely from HDBSCAN token membership, which has no such failure mode, so its merge counts are the authoritative ones for ALBERT-xlarge. Any nMerges = 0 in a spectral-sourced column for that model is the spectral artifact, not an absence of merges. </details <details <summary The code: Hungarian-on-Jaccard matching and merge detection</summary The overlap matrix is Jaccard over membership sets, noise excluded: python def jaccard overlap matrix labels a, labels b : ids a = sorted set labels a - {-1} ids b = sorted set labels b - {-1} sets a = {c: set np.where labels a == c 0 for c in ids a} sets b = {c: set np.where labels b == c 0 for c in ids b} overlap = np.zeros len ids a , len ids b for i, ca in enumerate ids a : for j, cb in enumerate ids b : inter = len sets a ca & sets b cb union = len sets a ca | sets b cb overlap i, j = inter / union if union else 0.0 return overlap, ids a, ids b The matching is Hungarian on the negated overlap, then merges are the unmatched-prev-into-matched-curr case: python from scipy.optimize import linear sum assignment maximum-weight assignment → minimize negated overlap padded to square cost = np.zeros size, size ; cost :n prev, :n curr = -overlaprow ind, col ind = linear sum assignment cost matches = for r, c in zip row ind, col ind : if r < n prev and c < n curr and overlap r, c = min jaccard: min jaccard = 0.1 matches.append ids prev r , ids curr c , float overlap r, c merge: an unmatched prev cluster overlapping a curr cluster that is already matchedfor up in unmatched prev: best j = int np.argmax overlap ids prev.index up , : target = ids curr best j if overlap ..., best j = min jaccard and target in matched curr: record prev clusters that fed target , target as one merge event Trajectories are chains of matched layer, cluster id tips, advanced one transition at a time. The run-level summary is what the report prints: "summary": { "total births": total births, "total deaths": total deaths, "total merges": total merges, "max alive": max alive, "n trajectories": len traj info , "mean lifespan": float np.mean t "lifespan" for t in traj info , "max lifespan": max t "lifespan" for t in traj info ,} </details What came out: Merges are recorded as discrete events at specific layer transitions, and they accumulate with depth — the signature of the slow timescale. For ALBERT-base on a Wikipedia paragraph, the tracker finds on the order of 287 trajectories at 48 iterations with a mean lifespan around 5 layers, a max lifespan near 27, and roughly 57 merge events, with merge transitions concentrated in the early-to-mid layers and a long tail. The short-heterogeneous prompt produces far fewer, longer-lived trajectories around 20, max lifespan 21 — fewer tokens, cleaner clusters, slower merging. ALBERT-xlarge trajectories are shorter and noisier, consistent with its slower, not-yet-collapsed dynamics, and its merge counts come from this tracker rather than the degenerate spectral-$k$. The merge sequence is one of the artifacts handed forward: the layers where clusters merge are candidate sites for the Phase 2 energy-violation cross-referencing. Code excerpts are from core/metrics.py linear cka , nearest neighbor indices , analysis.py per-layer loop, P1-7 plateau set , reporting.py detect plateaus , multi-signal vote, prompt sensitivity , and cluster tracking.py jaccard overlap matrix , match layer pair , track clusters . Snippets are lightly trimmed but otherwise verbatim. Trajectory counts are from the pre-fix cross-run report and will be re-derived after the Fixes 1–8 rerun. Finding — CONFIRMED.CKA shows reproducible flat-near-1 runs punctuated by sharp drops, and the discrete NN-stability plateaus coincide with them on the real prompts — the continuous and discrete measures agree on where the metastable windows are, which is the cross-check that makes the persistence claim credible rather than an artifact of one metric. At those layers the stable nearest-neighbor cycles are predominantly semantic distinct tokens drawn together , not positional duplicates, so the plateaus reflect structural grouping. Trajectory tracking records merges as discrete events that accumulate with depth — the signature of the slow timescale for ALBERT-base on a Wikipedia paragraph: on the order of 287 trajectories at 48 iterations, mean lifespan ~5 layers, ~57 merges . The prompt-sensitivity split is the load-bearing result: GPT-2 onset is weight-level low SD across prompts , BERT onset is content-driven.Caveat PROVISIONAL :plateau widths are currently mass-near-1 windows, not consensus windows; the joint-criterion definition and its uncertainty band await the Fix 7 sensitivity rerun, and the two-timescale ratio inherits that uncertainty. Question: does the attention mechanism itself show the structure we'd expect if clustering is happening — or are the geometric clusters invisible to attention? Groups A and B looked at the token cloud: where tokens are on the sphere and whether that arrangement persists. Group C looks at the other half of the layer — the attention matrix — and asks whether it carries the same story. This matters because in the theory, attention is not a bystander. Attention is the force that pulls tokens together; the clustering in the residual stream is supposed to be the consequence of attention routing similar tokens toward each other. So if the geometric clusters from Group A are real, attention should show a matching signature: concentrated, cluster-respecting routing rather than uniform mixing. If attention looks uniform while the geometry clusters, the two are decoupled and the mechanistic picture is wrong. The three instruments build up in specificity. Attention entropy asks the coarsest version — is attention concentrating at all? The Sinkhorn–Fiedler analysis asks whether the concentrated attention forms a cluster-separated graph matching the geometric clusters. The per-head classification asks which heads do this and whether they do it as a fixed property of the weights or in response to the input. One structural caution sits over the entire group and is stated again at each instrument: GPT-2's attention is causally masked, and the mask alone can manufacture a cluster-like attention graph independent of content. The routing results for the GPT-2 family are therefore suspended pending the causal-mask baseline. The code to run that baseline exists Fix 3 ; the runs do not yet. For each attention head, the Shannon entropy of each row of the attention matrix, averaged over rows. Low entropy means each token attends to a few others; high entropy means it spreads attention evenly. Why this instrument: This is the most direct read on whether attention is concentrating. A row of the attention matrix is a probability distribution over which tokens this token attends to. If clustering is happening, that distribution should be peaked — tokens should attend to their cluster, not to everyone. Entropy measures exactly how peaked. It also connects to the theory's temperature parameter: in the paper, the inverse temperature $\beta$ controls how sharply attention concentrates, and the metastability conjecture holds for large $\beta$ . Falling attention entropy across layers is the empirical signature of the model operating in the high-$\beta$ regime where metastability is predicted to appear. <details <summary What attention entropy measures, and its link to the theory's β</summary An attention matrix row $A {i,:}$ is a probability distribution: $A {ij} \geq 0$ and $\sum j A {ij} = 1$ softmax output . Its Shannon entropy is Averaging over the $n$ rows gives a per-head scalar. The bounds are interpretable: So entropy runs from $\log n$ no structure down to $0$ maximally concentrated , and watching it fall across layers tells you attention is sharpening. In the paper's dynamics, attention weights are $\propto \exp \beta \langle x i, x j \rangle $. The inverse temperature $\beta$ controls sharpness: at $\beta = 0$ attention is uniform maximum entropy ; as $\beta \to \infty$ attention becomes a hard argmax zero entropy . So attention entropy is a monotone proxy for the effective $\beta$ the model is operating at — low entropy corresponds to high effective $\beta$. This is why entropy is the right opener for Group C. The metastability conjecture in the paper is a large-$\beta$ phenomenon. If measured attention entropy is high and flat effective $\beta$ near zero , there is no reason to expect metastability and the geometric clustering would need a different explanation. If entropy falls with depth, the model is moving into the regime where the theory's prediction applies, and the geometric clustering from Group A has a mechanism behind it. Entropy is computed per head because heads specialize — some concentrate, some stay diffuse, and an average over heads would hide that. Group C reports the per-head values and uses the mean only as a summary. The per-head spread is itself the subject of C.3. </details <details <summary The code: entropy of the softmax rows</summary The entire instrument is one vectorized function over all heads at once: php def attention entropy attn matrix: torch.Tensor - np.ndarray: """ Shannon entropy of each attention row, averaged over tokens. attn matrix : n heads, n tokens, n tokens Returns n heads, — mean entropy per head. """ attn = attn matrix.numpy log attn = np.log attn + 1e-12 1e-12 guards log 0 entropy per token = - attn log attn .sum axis=-1 n heads, n tokens return entropy per token.mean axis=-1 n heads, - attn log attn .sum axis=-1 is $H i$ for every row of every head; .mean axis=-1 averages over the $n$ rows. The two reference points are pinned by tests: uniform attention returns $\log n$ and identity each token attends only to itself returns $0$, confirming the scale runs from no-structure to fully-concentrated as described. </details What came out: Provisional — pending the full rerun. The expected and observed pattern is attention entropy falling with depth across models on real prompts: attention sharpens as the residual stream clusters, placing the models in the high-effective-$\beta$ regime where metastability is predicted. The fall is not monotone everywhere — local rises at reorganization layers are expected, and they tend to line up with the plateau-ending merges from Group B. The numbers here will be re-derived from the rerun; the qualitative claim entropy concentrates with depth is the load-bearing one and is robust to the open code questions. We make each attention matrix doubly stochastic Sinkhorn–Knopp , treat it as a weighted graph, and compute the Fiedler value — the second-smallest Laplacian eigenvalue. Low Fiedler means the attention graph is nearly disconnected: tokens are routed into separate clusters. High Fiedler means tokens mix freely. Why this instrument: Entropy says attention is concentrating, but not into what shape . Concentrated attention could still mix all tokens a hub everyone attends to or could split them into separated groups. The Fiedler value distinguishes these. It is the graph-theoretic measure of how close a graph is to falling apart into disconnected components — exactly the cluster-separation structure that would confirm attention is implementing the geometric clusters from Group A. The Sinkhorn step first puts the attention matrix into the doubly stochastic form that the paper's Section 3.3 identifies as the gradient-flow object, so the Fiedler value is computed on the theoretically meaningful normalization rather than raw attention. <details <summary Why doubly stochastic, what the Fiedler value is, and how to read it</summary Raw attention is row-stochastic each row sums to 1 but not column-stochastic — some tokens receive far more attention than others. The paper's Section 3.3 connects the idealized dynamics to a doubly stochastic object both rows and columns sum to 1 , which is the symmetric, balanced form a gradient flow would produce. The gap between raw attention and its doubly stochastic version is itself a diagnostic "row/col balance" below : zero means attention is already balanced, large means it is far from the idealized form. Sinkhorn–Knopp produces the doubly stochastic matrix by alternately normalizing rows and columns until both converge. It is the standard, provably convergent way to get there. Treat the doubly stochastic matrix $P$ as a weighted graph after symmetrizing, $ P + P^\top /2$ . The graph Laplacian $L = D - W$ encodes its connectivity. The eigenvalues of the normalized Laplacian start at $\lambda 1 = 0$ always — the trivial constant mode and increase. The second one, $\lambda 2$, is the Fiedler value algebraic connectivity : The number of near-zero eigenvalues equals the number of near-disconnected components, which is why a related eigengap reading gives a cluster count below . The Group C question is whether attention knows about the geometric clusters. The Fiedler value makes that testable directly: if low-Fiedler layers cluster-separated attention coincide with the layers where Group A finds geometric clusters and Group B finds plateaus, then attention and geometry are telling the same story. The Spearman cross-check below quantifies exactly this coincidence. The same spectrum gives a cluster count: for a $k$-cluster structure, the $k$ largest eigenvalues of $P$ sit near 1 with a drop below them. The largest gap in the descending eigenvalue sequence locates $k$ without a hard threshold. This is the Fix 8 replacement for an earlier hard $ 0.5$ rule; it falls back to the hard threshold only when no clear gap exists, i.e. on near-uniform post-collapse matrices. </details <details <summary The code: Sinkhorn iteration, Fiedler value, and the eigengap count</summary Sinkhorn–Knopp is alternating row/column normalization to convergence, batched across all heads: python def sinkhorn normalize batched A, max iter=SINKHORN MAX ITER, tol=SINKHORN TOL : """A: n heads, n, n raw attention → doubly stochastic per head.""" P = np.clip A.astype np.float64 , 1e-12, None for in range max iter : P prev = P.copy P /= P.sum axis=2, keepdims=True row-normalise all heads P /= P.sum axis=1, keepdims=True col-normalise all heads if np.abs P - P prev .max < tol: break return P The Fiedler value is $\lambda 2$ of the normalized Laplacian of the symmetrized matrix: php def fiedler value P: np.ndarray - float: """λ₂ of the normalised Laplacian. λ₂≈0 → cluster-separated; large → mixing.""" P sym = P + P.T / 2 L = laplacian P sym, normed=True k = min 3, P.shape 0 - 1 eigenvalues = eigh L, eigvals only=True, subset by index= 0, k - 1 return float eigenvalues 1 if len eigenvalues 1 else 0.0 The cluster count reads the eigengap Fix 8 , falling back to a hard threshold only when there is no real gap: php def sinkhorn cluster count P, min gap ratio: float = 0.1 - int: k largest eigenvalues near 1; largest gap in the descending sequence = k. Fallback to hard 0.5 count when largest gap < min gap ratio λmax−λmin , i.e. on near-uniform post-collapse matrices with no genuine structure. ... The per-layer summary records the mean Fiedler, the per-head Fiedler list, the eigengap cluster count, and the row/col balance distance of raw attention from doubly stochastic : result = { "fiedler mean": float np.mean fiedler vals , "fiedler per head": fiedler vals, "sinkhorn cluster count mean": float np.mean cluster counts , "row col balance mean": float np.mean row col balance ,} Caveat carried from the plateau detector. Fiedler plateaus are found with detect plateaus fiedler, tol=0.05 , and that routine measures relative flatness span / |mean| + 1e-8 . The cluster-separated state is exactly Fiedler ≈ 0, where the denominator collapses and the relative criterion can fail to register a genuinely flat-near-zero window. Until the detector gains an absolute-tolerance branch, low-Fiedler plateaus may be under-detected — read the raw Fiedler trace alongside the detected plateaus. </details <details <summary The cross-check: does low Fiedler co-occur with high cluster count?</summary The single most direct answer to "does attention know about the clusters" is the correlation between the attention-graph Fiedler value and the geometric HDBSCAN cluster count, layer by layer. If attention is implementing the clustering, low Fiedler separated attention should co-occur with high cluster count multi-cluster geometry — a negative correlation. fied hdb pairs = r "sinkhorn" "fiedler mean" , r "clustering" "hdbscan" "n clusters" for r in layers if "sinkhorn" in r and not np.isnan r "clustering" "hdbscan" "n clusters" rho, pval = spearmanr p 0 for p in fied hdb pairs , p 1 for p in fied hdb pairs rho < -0.4 → attention Fiedler tracks geometric cluster count signal rho ≈ 0 → attention and geometry are telling different stories A Spearman $\rho < -0.4$ is the threshold the report uses to call this an interpretable signal; $\rho \approx 0$ would mean the attention graph and the token geometry are decoupled, which would undercut the mechanistic claim even though both individually show clustering. </details What came out: Provisional. The expected result is low-Fiedler windows coinciding with the geometric cluster windows from Group A, and a negative Fiedler–cluster-count correlation strong enough to read as signal rather than noise. Where that holds, attention and geometry are confirmed to be the same phenomenon seen from two sides. The honest caveats: the Fiedler plateau detection may under-report near-zero windows until the detector is fixed, and any low-Fiedler reading on GPT-2 is confounded by the causal mask next instrument . The Spearman value is the number to watch in the rerun — it is the cleanest single test of the Group C question. Each head is classified by its typical Fiedler value across the active layers: CLUSTER consistently routes into separated clusters , MIXED variable , or MIXING consistently lets tokens mix . Whether a head's classification is stable across prompts separates weight-level routing from content-driven routing. Why this instrument: The mean Fiedler hides the division of labor. Some heads may be dedicated cluster-formers while others mix; averaging erases that. Per-head classification recovers it. And comparing a head's behavior across prompts answers the deeper question: is a head's routing role a fixed property of its learned weights same classification regardless of input — STABLE , or does it depend on the content VARIABLE ? That STABLE-vs-VARIABLE split is the attention-side version of the weight-level-vs-content-driven distinction Group B found in plateau onsets, and it feeds the two-regime synthesis. <details <summary How heads are classified, the active-phase restriction, and the causal-mask problem</summary For each head, collect its Fiedler value at every active-phase layer, take the mean, and bin it: Classification is restricted to layers where effective rank ≥ 10. The reason is a saturation artifact: once tokens collapse to a near-point-mass rank below ~10 , there is only one cluster, the doubly stochastic attention matrix is nearly uniform, and the Laplacian has no gap — so every head's Fiedler trivially saturates to ≈ 1.0. Including those layers would pull every head's mean toward 1.0 and label everything MIXING regardless of its real role. Excluding them keeps the classification meaningful. Note this rank-10 threshold is specific to this analysis and separate from the rank-2 degeneracy gate used for CKA/NN — it is a larger cutoff because Fiedler saturation sets in well before full point-mass collapse. It is currently hardcoded in the profiler and would be cleaner in config. A head that lands in the same class for every prompt is STABLE — its routing role is a property of the weights, input-independent. A head whose class changes with the prompt is VARIABLE — content-driven. The cross-prompt consistency table reports this per head per model. This is the central caution of Group C. GPT-2 attention is causally masked lower-triangular . Sinkhorn-normalizing and symmetrizing a lower-triangular matrix forces a low-connectivity graph regardless of content — which would manufacture low Fiedler, and therefore a "100% STABLE-CLUSTER" reading across every prompt, as an artifact of the mask rather than a fact about the weights. The fix has two parts, both implemented Fix 3 , neither yet run: Until both runs are done, the GPT-2 routing result is suspended. </details <details <summary The code: deviation-based classification and the two causal controls</summary The main analysis computes raw per-head Fiedler, and for causal models also the mask-only baseline and each head's deviation from it: fiedler vals = fiedler value P all h for h in range n heads result "fiedler per head" = fiedler vals Fix 3a — causal-mask baseline subtractionif is causal: baseline = causal fiedler baseline n Fiedler of uniform-causal attn result "fiedler baseline" = baseline result "fiedler per head deviation" = round f - baseline, 6 for f in fiedler vals causal fiedler baseline is the content-free reference — the Fiedler value the mask produces on its own: php def causal fiedler baseline n: int - float: A base = uniform causal attention n uniform within the lower triangle P base = sinkhorn normalize A base return fiedler value P base what the mask alone forces The BERT control applies an artificial causal mask to a bidirectional model to test whether masking alone collapses heads to CLUSTER: Fix 3b — causal-mask control run on BERT if apply causal control: causal mask = np.tril np.ones n, n attn masked = attn causal mask None, :, : zero upper triangle attn masked /= attn masked.sum axis=2, keepdims=True renormalise rows P ctrl = sinkhorn normalize batched attn masked result "fiedler causal control per head" = fiedler value P ctrl h for h in range n heads The classifier then chooses the signal based on model type — deviation for causal models, raw Fiedler otherwise — and restricts to the active phase: in per head fiedler profile, active phase = effective rank = 10 causal models classify on fiedler per head deviation; others on raw fiedler per headclassification = "CLUSTER" if mean < 0.3 else "MIXED" if mean < 0.7 else "MIXING" </details What came out: Provisional, and partly suspended. For BERT, heads are expected to be VARIABLE — routing roles shift with content, consistent with the content-driven plateau onsets from Group B. For ALBERT-base, head behavior transitions with iteration depth, which the weight-sharing architecture makes natural to read as a time axis. For the GPT-2 family, the result is suspended : the apparent "100% STABLE-CLUSTER" reading is exactly what the causal mask would produce on its own, so it cannot be attributed to the weights until the baseline-subtracted classification and the BERT causal-mask control have been run. The deviation-based numbers and the control verdict are the outputs to generate in the rerun; the raw STABLE-CLUSTER reading should not be reported as a finding before then. Attention concentrates with depth C.1 , and where it concentrates it forms cluster-separated graphs that — provisionally — track the geometric clusters from Group A C.2 . The per-head picture C.3 shows a division of labor that is content-driven in BERT and depth-staged in ALBERT-base. The one result Group C cannot yet deliver is the GPT-2 routing claim, because the causal mask confounds the Fiedler value and the disambiguating runs are still pending. The Group C contribution that carries forward regardless of how the GPT-2 question resolves is the per-head split itself: attention routing is not uniform across heads, and the CLUSTER/MIXING distinction is a standing architectural fact the later phases build on. Code excerpts are from core/metrics.py attention entropy , sinkhorn.py sinkhorn normalize batched , fiedler value , sinkhorn cluster count , causal fiedler baseline , analyze attention sinkhorn , and reporting.py per head fiedler profile , Fiedler–cluster Spearman cross-check . Snippets are lightly trimmed but otherwise verbatim. All findings are provisional pending the Fixes 1–8 rerun; the GPT-2 routing result is suspended pending the Fix 3 causal-mask baseline and BERT control runs. Finding — PARTIAL / SUSPENDED. Provisional pending the full rerun. Attention entropy falls with depth across models on real prompts — attention sharpens as the residual stream clusters, placing the models in the high-effective-β regime where metastability is predicted; local rises tend to line up with the plateau-ending merges from Group B. Low-Fiedler windows are expected to coincide with the geometric cluster windows from Group A, with a negative Fiedler–cluster-count Spearman correlation as the cleanest single test of "attention knows about the clusters." The per-head picture is content-driven in BERT and depth-staged in ALBERT-base. The GPT-2 routing result is suspended pending the Fix 3 causal-mask baseline and BERT causal-mask control — until those run, the raw "stable cluster-routing" reading is not reported as a finding. The contribution that carries forward regardless: attention routing is not uniform across heads, and the CLUSTER/MIXING split is a standing architectural fact the later phases build on. Question: the theory predicts that a specific interaction energy increases monotonically across layers, as if the model were doing gradient ascent toward consensus. Does it? This is the group where the theory makes its sharpest, most falsifiable prediction — and where the trained models break it, universally. Groups A through C asked whether trained transformers look like the idealized dynamics: do they cluster, persist, route attention into clusters? The answers were mostly yes. Group D asks whether they are those dynamics in the strict sense the theory requires, and the answer is no. There is a specific scalar — the interaction energy — that the idealized gradient flow must drive monotonically upward at every step. In every model, on every prompt, that scalar goes down somewhere. The clustering is real, but it is not being produced by the pure-attraction mechanism the theory describes. That makes Group D the seam of the whole project. The two instruments here are not just measuring whether the energy is monotone; the second one localizes where and on which token pairs the violation happens, which is the evidence Phase 2 picks up to ask what mechanism — the value matrix's mixed-sign spectrum — is responsible. At each layer we compute a single scalar from the Gram matrix — the interaction energy. The theory says it must increase from one layer to the next. We check whether it does. Why this instrument: The paper's dynamics are a gradient flow: the tokens move so as to ascend an energy landscape, and the interaction energy is that landscape. Monotone increase is not an incidental property — it is the definition of what "gradient ascent toward consensus" means. So this is the most direct test of whether trained attention implements the idealized dynamics. Unlike clustering which many mechanisms could produce , strict monotonicity is a signature only gradient ascent on this particular energy produces. A single downward step falsifies it. <details <summary The energy, the sign convention, and what a violation actually means</summary For tokens on the sphere with Gram matrix $G {ij} = \langle x i, x j \rangle$, the interaction energy at inverse temperature $\beta$ is Read the structure: every term $e^{\beta \langle x i, x j \rangle}$ is large when tokens $i$ and $j$ point the same way inner product near $+1$ and small when they point apart. So $E \beta$ is large when tokens are aligned and small when they are spread out. Maximal clustering — every token identical — maximizes it. The analytical reference values make this concrete: for a fully collapsed cloud $E \beta = e^\beta / 2\beta$, for two antipodal clusters $\cosh \beta /2\beta$, for a uniform spread $\approx 1/2\beta$, and $e^\beta \cosh\beta 1$, so collapsed antipodal uniform. The idealized dynamics ascend this energy. As depth increases and tokens cluster, $E \beta$ should rise at every layer. Plotted against layer index, it should be a non-decreasing curve. This is the gradient-ascent picture: the model is climbing toward consensus, and the energy is the height it has climbed. A violation is a layer where $E \beta$ decreases : $E {\text{curr}} < E {\text{prev}}$. Mechanistically this is sharp. For the energy to fall, some pairs must have their $e^{\beta \langle x i, x j \rangle}$ terms shrink, which means those tokens moved apart — their inner product dropped. That is a locally repulsive move . The pure-attraction dynamics in the paper cannot do this. Attention only pulls tokens together; it has no repulsive term. Under the idealized flow, the energy is a Lyapunov function — it can only go up. So an energy drop is not a small quantitative deviation; it is qualitative evidence of a force the theory does not contain. Something in the trained layer pushed two tokens apart. Group D's job is to detect that, and D.2's job is to find the tokens it happened to. This is a methodological point that the result depends on. $e^{\beta \langle x i, x j \rangle}$ grows fast with $\beta$: at $\beta = 5$ the energy values are large, and so is the float32 noise on them. An absolute threshold for "did the energy drop" the old $-10^{-4}$ / $-10^{-6}$ gates is scale-blind — at large $\beta$ it fires on numerical noise, manufacturing violations that aren't real. Fix 2 replaced it with a relative criterion: \frac{E {\text{prev}} - E {\text{curr}}}{|E {\text{prev}}|} \text{rel tol}, \qquad \text{rel tol} = 10^{-3}. A violation now means the energy fell by more than 0.1% of its own magnitude — a real drop, not a rounding artifact, at any $\beta$. This matters for interpreting the universality claim: a violation that survives the relative threshold is trustworthy; the claim that violations occur at every $\beta$, including large $\beta$, is precisely what the relative criterion exists to test, because under the old absolute gate large-$\beta$ violations were partly guaranteed by construction. </details <details <summary The code: the energy and the relative violation test</summary The energy is computed for all $\beta$ in one vectorized pass over the cached Gram matrix: php def interaction energies batched G: np.ndarray, beta values: list - dict: """E beta = 1 / 2β n² Σ ij exp β ⟨x i, x j⟩ , for every β at once.""" n = G.shape 0 betas = np.asarray beta values, dtype=np.float64 B, exp G = np.exp betas :, None, None G None B, n, n sums = exp G.sum axis= 1, 2 B, energies = sums / 2.0 betas n n B, return {float beta : float e for beta, e in zip beta values, energies } In the loop it is one line per layer: lr "energies" = interaction energies batched G, beta values {β: E β} The violation test is the scale-invariant Fix 2 criterion, applied to one $\beta$-series across layers: ENERGY VIOLATION REL TOL = 1e-3 a drop counts only if 0.1% of |E prev|def energy violation severity energies, rel tol=ENERGY VIOLATION REL TOL : arr = np.array energies, dtype=np.float64 diffs = np.diff arr ref = np.maximum np.abs arr :-1 , 1e-12 |E| at the preceding layer rel drop = -diffs / ref positive = energy fell viol mask = rel drop rel tol return dict violation layers = i + 1 for i, v in enumerate viol mask if v , n violations = int viol mask.sum , max severity = float rel drop viol mask .max if viol mask.any else 0.0, sum severity = float rel drop viol mask .sum , total rel change = float arr -1 - arr 0 / max abs arr 0 , 1e-12 , total rel change is the net climb from first to last layer the energy does rise overall — it just isn't monotone ; violation layers are the steps where it dropped against the trend. The reporting layer runs this for every $\beta \in {0.1, 1.0, 2.0, 5.0}$ and also checks whether the violation layers coincide with the merge events from Group B — testing whether the repulsive moves happen at merges or independently of them. </details What came out: Universal violation. Every model, every prompt, every $\beta$ tested shows at least one layer where $E \beta$ falls against the trend — while the net change across the full depth is still positive the energy climbs overall, it just refuses to climb monotonically . This is the cleanest hard result in Phase 1: monotonicity is the theory's most specific prediction and trained transformers break it without exception. The grounded qualification: the violation is most robust at moderate $\beta$ under the scale-invariant Fix 2 criterion; the claim that it holds at every $\beta$ including large $\beta$ is exactly what the relative-threshold rerun confirms, since the previous absolute gate could not be trusted there. The specific severity numbers max severity , sum severity per model are pending that rerun; the qualitative universality is not in doubt. At each violation layer, we find the individual token pairs whose contribution to the energy dropped the most — the pairs that moved apart. We flag whether they are structural tokens or semantic ones. Why this instrument: D.1 establishes that a repulsive move happened. D.2 asks between which tokens . This is the difference between a phenomenon and a mechanism. If the energy drops are driven by structural tokens — CLS , SEP , punctuation — the repulsion is plausibly positional or structural bookkeeping. If they are driven by content-bearing tokens, the repulsion is operating on semantic geometry. Either way, the specific pairs are the evidence handed to Phase 2, which asks whether the value matrix's mixed-sign eigenspectrum is the source of the repulsive directions. The localization turns "the theory is violated" into "here is the fingerprint of what violates it." <details <summary How a pair's contribution to the energy change is isolated</summary The energy is a sum over pairs, so the change in energy between layer $L$ and $L+1$ decomposes exactly into per-pair contributions: A pair with $\Delta {ij} < 0$ contributed to the energy drop — its tokens moved apart lower inner product at $L+1$ than at $L$ . The most-negative $\Delta {ij}$ are the pairs most responsible for the violation. Sorting all pairs by $\Delta$ ascending and taking the top few gives the localization. Each token in a top pair is checked against a set of structural tokens — CLS , SEP , <s , </s , padding, and single-character punctuation. The interpretation forks: Once the cloud collapses to a near-point-mass, the inner products are all $\approx 1$, the $e^{\beta \cdot 1}$ terms are nearly equal, and the per-pair deltas are floating-point noise. Energy "violations" detected there are not real, so violation layers in the degenerate regime are suppressed and the pair localization is not reported for them — the same rank-gate logic as the rest of Phase 1. </details <details <summary The code: the per-pair delta and the structural-token flag</summary The public wrappers are thin — one normalizes raw tensors, one takes the pre-normed arrays from the loop — and both call a shared core the formula below is verbatim from the function's contract : python def energy drop pairs from normed normed before, normed after, beta, top k=10 : """Token pairs i,j sorted by Δ ascending most negative first , where Δ = exp β⟨x i,x j⟩ after − exp β⟨x i,x j⟩ before / 2β n² .""" return energy drop pairs core normed before, normed after, beta, top k The core is the pairwise delta, upper triangle only, sorted to surface the most repulsive pairs: python def energy drop pairs core normed before, normed after, beta, top k : n = normed before.shape 0 if n < 2: return G before = normed before @ normed before.T G after = normed after @ normed after.T delta = np.exp beta G after - np.exp beta G before / 2 beta n n iu = np.triu indices n, k=1 i < j only pairs = int i , int j , float delta i, j for i, j in zip iu pairs.sort key=lambda t: t 2 ascending → most negative first return pairs :top k In the loop, this is computed only at violation layers and only when the layer is non-degenerate — using prev normed the bug fixed in Fix 8: the previous-layer array must still point at $L$, not be overwritten to $L+1$ before this call : prev normed still points to layer L here Fix 8 deferred its update lr "energy drop pairs" = { beta: energy drop pairs from normed prev normed, normed, beta, top k=10 for beta in beta values} if is violation and lr "effective rank" = DEGENERATE RANK THRESHOLD else {} The reporting layer maps the top pairs back to token strings and annotates structural ones: SPECIAL TOKENS = {" CLS "," SEP ","<s ","</s ","<pad "," PAD ","<|endoftext| ","Ġ","▁"}PUNCT CHARS = set "., ?;:'\"-–— {}…/\\" each top pair printed as: 'tok i' FLAG ↔ 'tok j' FLAG with δ FLAG ∈ { CLS , SEP , PUNCT } marks structural/special-token repulsion </details What came out: Localization patterns provisional pending rerun; the existence of localizable drops is confirmed. The violations localize to specific, identifiable pairs rather than being diffuse noise across all pairs — which is itself evidence they are mechanistic, not numerical. The structural-vs-semantic split is the detail Phase 2 needs, and the report already separates the two; whether structural tokens dominate the repulsion positional bookkeeping or semantic tokens do repulsion on content is the question the rerun answers per model. The violation layers' coincidence or not with Group B's merge events is computed in the same place: it tests whether the repulsive moves are the mechanism of merging or a separate phenomenon. The energy result is the load-bearing transition out of Phase 1. The clustering is real A , it persists B , and attention implements it C, provisionally — but it is not the pure-attraction gradient flow the theory describes, because that flow cannot lower the energy and the trained models lower it everywhere D . The repulsion has to come from somewhere the idealized model omits. The natural suspect is the value matrix $V$: the idealized dynamics assume $V = I$ purely attractive , while a trained $V$ with negative eigenvalues would supply exactly the repulsive directions needed to push token pairs apart and drop the energy. Group D's localized pairs — which tokens, at which layers, structural or semantic — are the targets Phase 2 cross-references against $V$'s eigenspectrum. The violation is not a failure of the measurement; it is the finding. Code excerpts are from core/metrics.py interaction energies batched , energy violation severity , energy drop pairs from normed , energy drop pairs core and analysis.py / reporting.py loop usage, structural-token flagging . The public energy functions and the violation criterion are verbatim; the localization core is shown faithful to its documented formula and test-pinned behavior. The universal-violation finding is confirmed per the draft's epistemic status ; severity magnitudes and localization patterns will be re-derived after the Fixes 1–8 rerun under the scale-invariant threshold. Finding — FALSIFIED. Universal violation: every model, every prompt, every β shows at least one layer where E β falls against the trend, while the net change across the full depth is still positive the energy climbs overall — it just refuses to climb monotonically . Monotonicity is the theory's most specific prediction, and trained transformers break it without exception. The violations localize to specific, identifiable token pairs rather than being diffuse — itself evidence they are mechanistic, not numerical. Caveat PROVISIONAL : the violation is most robust at moderate β under the scale-invariant Fix 2 criterion; the claim that it holds at every β including large β is exactly what the relative-threshold rerun confirms the old absolute gate could not be trusted there , and the severity magnitudes and structural-vs-semantic split are pending that rerun. The qualitative universality is not in doubt. The natural suspect for the repulsion is the value matrix V: the idealized dynamics assume V = I purely attractive , while a trained V with negative eigenvalues supplies exactly the repulsive directions needed — the Phase 2 target. Question: the metastability conjecture requires not just that clusters form and persist, but that the formation timescale and the collapse timescale are meaningfully different. Is there actually a gap? Metastability is a two-timescale claim. Tokens group quickly into clusters fast , then those clusters slowly merge until the cloud collapses slow . Groups A–C confirmed both halves happen; Group D confirmed the energy that should govern them misbehaves. Group E asks the quantitative question that decides whether "metastability" is the right word at all: is the slow timescale actually slow relative to the fast one? If clusters form and then immediately collapse, the two timescales coincide and there is no metastable window — just a fast transition with a clustered-looking midpoint. This is the most provisional group in Phase 1, for two reasons that are worth stating before any numbers. First, the original way of computing the separation ratio was confounded, and produced a table that contradicts itself — Fix 5 redefined it, and the corrected numbers do not exist until the rerun. Second, the instrument that actually explains the separation pattern is not the ratio at all but the eigenspectrum of the value matrix $V$, and that finding spectral radius governs collapse speed, not depth or dimension is both the cleaner result and the one that falsifies a specific theorem. So read E.1 as a measurement under repair and E.2 as the result that survives. A single ratio per run: how long the metastable window lasts merge time divided by how long it took to form formation time . A large ratio means a genuine slow timescale; a ratio near 1 means the two timescales coincide. Why this instrument: It is the direct operationalization of the conjecture. "Two timescales are separate" has to become a number before it can be confirmed or denied, and the natural number is the ratio of the slow duration to the fast duration. The subtlety — and the source of the trouble — is in defining those two durations so the ratio measures the dynamics and not an artifact of how the denominator was chosen. <details <summary The two definitions of the ratio — the confounded one and the Fix 5 one</summary The first implementation defined separation as The denominator is the problem. "Collapse onset" was measured on a degenerate control input a string of repeated tokens , and how fast a model collapses a degenerate input is not comparable across architectures. BERT collapses the repeated-token control at layer 1 onset = 1 , so its ratio is plateau-width / 1 = 8.0 → CONFIRMED. GPT-2-medium collapses the same control at layer 10, so its ratio is 4.5 / 10 = 0.45 → NO SEPARATION. The numerator and denominator come from different inputs , and the denominator is dominated by an architecture-specific quirk of how each model treats degenerate text. This produces the self-contradiction the draft flags: a 12-layer model BERT is CONFIRMED while a 24-layer model GPT-2-medium is NONE. If separation were a depth phenomenon, more layers should mean more separation. The table breaks any clean depth-threshold story — not because the dynamics are contradictory, but because the metric was measuring the wrong thing. Fix 5 discards the cross-input ratio and computes everything from one real-prompt run : Thresholds: ≥ 3.0 = CONFIRMED, 1.5–3.0 = WEAK, < 1.5 = NO SEPARATION. The cluster count is read from HDBSCAN where available spectral $k$ as fallback , so the multi-cluster test does not depend on the spectral method that degenerates for ALBERT-xlarge. The repeated-token control is kept — but only as a standalone "collapse speed diagnostic," never again as the ratio denominator. Both quantities now come from the same trajectory, on real text, so the ratio is internally comparable. A formation time of 4 and a merge time of 16 is a clean "the slow phase is 4× the fast phase" statement; it no longer smuggles in how a different input behaves. The numbers this produces do not exist yet — the rerun generates them — so every separation verdict below is provisional. </details <details <summary The code: the superseded control ratio, and the Fix 5 single-run ratio</summary The old control-based ratio now demoted to a standalone diagnostic, shown for contrast with what produced the contradictory table : SUPERSEDED — denominator comes from the repeated-tokens control inputmpw = mean plateau width model numerator: real-prompt plateau widthonset = first layer with mass above 0p9 denominator: COLLAPSE CONTROL inputratio = mpw / onsetinterp = "TWO-TIMESCALE CONFIRMED" if ratio 2.0 else "WEAK SEPARATION" if ratio 1.0 else "NO SEPARATION" The Fix 5 single-run replacement faithful to the Fix 5 specification — the function single trajectory separation is implemented per §0.5; this is its documented logic, not a verbatim paste : php def single trajectory separation layers data - dict: """Two-timescale separation from ONE real-prompt run. formation time = first stable multi-cluster plateau onset merge time = layers from that plateau to collapse/decay separation = merge time / formation time""" cluster count per layer: HDBSCAN preferred, spectral k fallback k series = layer cluster count l for l in layers data multi-cluster plateaus: detect plateaus on k, keep windows with mean k = 1.5 plats = detect plateaus k series, window=2, tol=0.5 multi = s, e for s, e, mean k in plats if mean k = 1.5 if not multi: return {"separation": 0.0, "verdict": "NO SEPARATION"} formation time = multi 0 0 onset of first multi-cluster plateau collapse layer = first collapse or decay k series k → 1 or structure decays merge time = collapse layer - formation time separation = merge time / formation time if formation time 0 else 0.0 verdict = "CONFIRMED" if separation = 3.0 else "WEAK" if separation = 1.5 else "NO SEPARATION" return {"formation time": formation time, "merge time": merge time, "separation": separation, "verdict": verdict} The repeated-tokens runs remain in the report under a COLLAPSE SPEED DIAGNOSTICS heading, explicitly marked as not a metastability test. </details What came out: Provisional — the corrected ratio does not exist until the rerun. Under the old, confounded metric the verdicts were CONFIRMED for BERT-base 8.0 , GPT-2-large 8.2 , GPT-2-xl 7.62 ; NO SEPARATION for GPT-2-medium 0.45 ; WEAK for ALBERT-base at 36–48 iterations 1.06–1.25 ; and NO SEPARATION for ALBERT-xlarge 0.25 . These should be read as superseded — the BERT-CONFIRMED / GPT-2-medium-NONE inversion is the artifact of the cross-input denominator, not a finding. The Fix 5 single-run ratio will replace all of them. What is not expected to change is the qualitative ordering, because — as E.2 shows — the ordering is governed by a quantity the ratio only indirectly reflects. The eigenvalues of the value matrix $V$. Its spectral radius largest eigenvalue magnitude predicts how fast a model collapses — better than the model's depth or its dimension. Why this instrument: This is the instrument that explains everything E.1 struggles to measure. The collapse speed of the dynamics is governed by the linear map the value pathway applies at each layer; the spectral radius of that map is the per-step amplification factor, so it sets the geometric rate of collapse. The paper's Theorem 6.1 predicts that higher dimension $d$ means faster convergence. The eigenspectrum lets us test that directly — and falsify it — by checking whether $d$ or the spectral radius is the variable that actually orders the models by collapse speed. <details <summary What the V spectrum encodes, and the Theorem 6.1 test</summary For the value matrix $V$ square, $d {\text{model}} \times d {\text{model}}$, for all three architectures' full projection , two spectra carry different information: The spectral radius $\rho V = \max k |\lambda k|$ is the dominant per-layer amplification. For an iterated linear map, structure along the dominant eigendirection grows like $\rho^{,t}$ over $t$ layers. So $\rho$ controls how many layers it takes to collapse: a back-of-envelope estimate for ALBERT-xlarge, $\rho = 1.278$, is $\log 0.9 /\log 1.278 \approx 75$ iterations to reach high mass — which is why it has not collapsed at 48 and matches its observed slow dynamics. Theorem 6.1 predicts higher $d$ → faster convergence. The decisive comparison is ALBERT-base $d = 768$ vs. ALBERT-xlarge $d = 2048$ . If the theorem governs, the higher-dimensional xlarge should collapse faster . It collapses far slower : ALBERT-base collapses by ~24 iterations, ALBERT-xlarge has not collapsed at 48. The dimension prediction is backwards for this pair. The variable that does order them correctly is $\rho V $ — lower spectral radius, slower collapse — so the spectral radius, not $d$, is the governing quantity. That is a direct falsification of the Theorem 6.1 prediction in trained models. Order the models by $\rho V $ rather than by depth and the contradiction dissolves. BERT-base has $\rho \approx 0.94$ — firmly in the fast-clean regime — despite being only 12 layers. GPT-2-medium has $\rho = 3.21$ 24 layers, no separation ; GPT-2-large has $\rho = 1.38$ 36 layers, separation . The "depth threshold between 24 and 36 layers" is really a spectral-radius threshold that happens to fall between medium and large within the GPT-2 family . Depth is a within-family proxy that breaks the moment BERT shallow but low-$\rho$ is included. The ratio table looked contradictory because it was sorted by the wrong axis. </details <details <summary The code: extracting V and computing the sign-aware spectrum</summary The value matrix is pulled per architecture per-layer for BERT/GPT-2, one shared matrix for ALBERT , then the sign-aware eigenspectrum is computed: extraction differs by architecture; GPT-2 slices V out of the fused c attnif "gpt2" in model name: for i, block in enumerate model.h : d = block.attn.c attn.weight.shape 1 v = block.attn.c attn.weight :, 2 d//3: .detach .cpu .float .numpy last third = V v matrices.append f"layer {i}", v bert: layer.attention.self.value.weight per layer; albert: one shared attn.value.weight for name, V in v matrices: sv = svdvals V magnitudes sign-blind if V.shape 0 == V.shape 1 : eigs = np.linalg.eigvals V complex — carries sign + phase real arr = np.real eigs is complex = np.abs np.imag eigs 0.01 np.abs real arr + 1e-8 spec radius = float np.abs eigs .max ← collapse-rate predictor frac pos = float real arr 0 .mean attractive directions frac neg = float real arr < 0 .mean repulsive directions frac complex = float is complex.mean rotational directions spec radius , eig frac pos real , eig frac neg real , and eig frac complex are stored per layer in v eigenspectrum.json for cross-referencing against the Group A/B plateau and collapse locations. Methodological seam Phase 2 refinement . This Phase 1 instrument eigendecomposes $V$ alone. Phase 2 argues the physically meaningful operator is the composed OV circuit $W O W V$ — the actual residual-stream-to-residual-stream map — and that eigendecomposing $V$ in isolation omits the output projection. Phase 2's extract ov circuit replaces this. The Phase 1 spectral-radius numbers are directionally robust the ordering and the Theorem 6.1 falsification hold , but the precise operator is refined there; the Phase 1 figures should be read as the $V$-only proxy, not the final OV-circuit values. </details What came out: The spectral radius of $V$ orders the models by collapse speed, and dimension does not. The GPT-2 family shows an abrupt drop in mean spectral radius — 3.21 at GPT-2-medium 24 layers to 1.38 at GPT-2-large 36 layers , a 2.3× drop with no intermediate value — coinciding exactly with the onset of two-timescale separation. BERT-base $\rho \approx 0.94$ behaves like the large/fast-clean regime despite its shallow depth, confirming that the governing variable is learned $\rho$ , not structural depth or $d$ . Theorem 6.1's dimension prediction is falsified by the ALBERT-base-vs-xlarge comparison: the higher-dimensional model collapses slower, opposite to the prediction, and $\rho V $ explains why. These spectral-radius numbers are from the Phase 1 $V$-only extraction; the qualitative claims survive the Phase 2 OV-circuit refinement, but the exact magnitudes will move. The two-timescale separation, measured properly Fix 5, single-run , is the open quantitative question whose numbers await the rerun — but the structure behind it is already clear from the value-matrix spectrum. Separation is real and architecture-specific, and it is governed by the spectral radius of $V$, not by depth or dimension. That single finding does three things: it resolves the self-contradiction in the old ratio table wrong sorting axis , it falsifies the paper's Theorem 6.1 dimension prediction directly, and it hands Phase 2 a concrete target — the sign-and-phase structure of $V$'s eigenvalues, where the attractive/repulsive/rotational directions live that Group D's energy violations said must exist. The collapse-speed law that comes out of Phase 1 is therefore not "deeper models cluster faster" but "lower-spectral-radius value matrices collapse slower," with depth a within-family confound that the BERT comparison exposes. Code excerpts are from plots.py analyze value eigenspectrum and reporting.py the superseded collapse-control ratio block . The $V$-extraction and eigenspectrum code is shown close to verbatim; single trajectory separation is shown faithful to its Fix 5 specification the implemented body was not directly retrievable . All separation verdicts are provisional pending the Fixes 1–8 rerun under the single-run definition; the spectral-radius ordering and the Theorem 6.1 falsification are the robust results, refined not overturned by Phase 2's OV-circuit operator. Finding — CONFIRMED separation / FALSIFIED Thm 6.1 . The spectral radius of V orders the models by collapse speed, and dimension does not. The GPT-2 family shows an abrupt drop in mean ρ V — 3.21 at gpt2-medium 24 layers to 1.38 at gpt2-large 36 layers , a 2.3× drop with no intermediate value — coinciding exactly with the onset of two-timescale separation. BERT-base ρ ≈ 0.94 behaves like the fast-clean regime despite being only 12 layers, confirming the governing variable is learned ρ , not structural depth or d . Theorem 6.1's dimension prediction is falsified by the matched ALBERT pair: the higher-dimensional xlarge d = 2048, ρ = 1.278 collapses slower than base d = 768 — opposite to the prediction — and ρ V explains why, including why xlarge has not collapsed at 48 iterations ~75 iterations expected from its ρ . Caveats: the separation ratios under the corrected Fix 5 definition do not exist until the rerun the old CONFIRMED/NONE verdicts are superseded artifacts of the cross-input denominator ; and these are the Phase 1 V-only spectral numbers — Phase 2 refines the operator to the composed OV circuit W O W V, which moves the magnitudes but not the ordering or the falsification. --- Clustering is universal, metastability is real, and the two timescales are real above a depth threshold. On the descriptive level, the idealized theory holds up well: trained transformers really do reorganize their tokens into clusters that persist and then slowly merge, across English prose of every length, and — pending the prompt-config reconciliation — across non-English text, code, and equations as well. The phenomenon is robust to what the model is reading. But the energy never strictly increases throughout all layers. The interaction energy that the theory's gradient flow must drive monotonically upward falls somewhere in every single run. This is not measurement noise and it is not a rare edge case; it is a systematic feature of learned attention. The clustering is produced by something that contains a repulsive component, and the pure-attraction model has no such component. The trained dynamics look like the idealized flow from a distance and are not the idealized flow up close. Dimension is a red herring. The theory predicts higher-dimensional models converge faster; the matched ALBERT pair shows the opposite, and the variable that actually orders the models by collapse speed is the spectral radius of the value matrix. Collapse speed is a learned quantity, not a structural one — which is why a shallow model BERT can sit in the fast-clean regime and a deep one need not. This produces two regimes. Models with low spectral radius and weight-level clustering onset are attractor-driven — the architecture clusters at a fixed depth regardless of input the GPT-2 family from large up, BERT by its ρ . Models whose clustering onset moves with the content are content-driven GPT-2-small, BERT by its onset . The distinction is not depth; depth is a within-family proxy that breaks the moment BERT — shallow but low-ρ — is included. The violations are the seam. Everything that confirms the theory hands forward cleanly; the one thing that breaks it is the most informative result in Phase 1, because it points at a specific mechanism. Phase 2 asks whether the value matrix's mixed-sign eigenspectrum is the source of the repulsion — whether the negative and complex eigenvalues of V refined to the OV circuit are where the energy-lowering, cluster-separating force actually lives. The energy violations and the V spectral-radius finding hand directly to Phase 2: the localized energy-drop pairs are the targets, and the sign-and-phase structure of the value matrix refined to the OV circuit W O W V is where Phase 2 looks for the repulsive directions the violations imply must exist. The cluster labels and trajectories feed the later mechanistic work. The per-head Fiedler split is a standing architectural fact the later phases build on. Three loose ends close out of Phase 1 explicitly: the GPT-2 routing claim awaits the causal-mask baseline; the two-timescale ratios await the Fix 5 / Fix 7 rerun; and the untrained-control and full-prompt-set results await the config reconciliation. Phase 2c the neural-dynamics comparison is cut from the series. --- File manifest. Per-run outputs. Cross-run report location. Plateau-sensitivity sweep results. Untrained-control run. Length-sweep results. Code excerpts throughout are from core/models.py , core/metrics.py , core/clustering.py , sinkhorn.py , cluster tracking.py , plots.py , analysis.py , and reporting.py , plus the Phase 1 test suite. Snippets are lightly trimmed comments and unrelated branches removed but otherwise verbatim, except where a function body was reconstructed faithful to its specification noted at the point of use . Provisional findings will be re-derived after the Fixes 1–8 rerun; the GPT-2 routing result is suspended pending the Fix 3 causal-mask baseline and BERT control.