# Analysis of Metastable States in the Transformer Activation Space

> Source: <https://www.lesswrong.com/posts/Kf8zkKnDajfGGHG7w/analysis-of-metastable-states-in-the-transformer-activation>
> Published: 2026-06-06 21:30:13+00:00

This is the first entry in a sequence.Over about ten parts, this series will work through a few humble experiments that test a mathematical theory of attention against real trained transformers.A project summary:a recent paper by Geshkovski, Letrouit, Polyanskiy, and Rigollet models attention as a dynamical system on the sphere and proves that tokens cluster and drift toward consensus, with a metastable two-timescale structure along the way — under the assumption that Q, K, and V are all the identity. This sequence asks how much of that survives in trained models (most of it does), finds the one prediction that failsuniversally(the energy is not monotone), and then spends the rest of the series tracking that failure to its mechanism in the value matrix.

Links:[all the code for every phase.][GitHub repository]:[Geshkovski et al., the theory this whole investigation is built on][Original paper]:[a video going over the original paper for those who prefer video][YouTube walkthrough]:

This is the first in what will be a sequence of ~10 posts. Each post will be aligned with a phase from the github.

**Metastability is real in trained transformers — but the mechanism the theory proposes for it is not.** Tokens cluster into metastable groups the way the idealized model predicts; the energy that theory says drives the process is violated in every model tested; and how fast a model collapses is set by its value matrix, not by depth or width as the theory claims.

Three predictions of the idealized theory survive the move to trained models, and survive cleanly. Token representations **cluster** over layers (CONFIRMED, every model, every real prompt). The clusters **persist** across runs of layers — metastable plateaus — before reorganizing (CONFIRMED). And the two timescales of metastability — fast formation, slow merging — are **genuinely separate** above a depth threshold (CONFIRMED). Two predictions break. The interaction energy the theory requires to **increase monotonically** falls somewhere in every model, every prompt, every temperature tested — the clustering is real but it is not the pure-attraction gradient flow the theory describes (FALSIFIED). And collapse speed is governed by the **spectral radius of the value matrix**, not by model dimension; the higher-dimensional model in our matched pair collapses *slower*, the reverse of what Theorem 6.1 predicts (FALSIFIED). Two honest caveats carry through: the GPT-2 attention-routing result is **suspended** because the causal mask can manufacture the signal on its own, and the two-timescale ratio magnitudes are **provisional** pending a rerun under a corrected definition. The clustering also does not appear in an untrained model with the same architecture, which is what makes "the network learns to do this" a finding rather than a property of the wiring.

**Scope.** Everything here measures what happens to token *representations* inside the stack — the geometry of the residual stream, layer by layer. It is not a claim about what the model outputs or how it behaves. When a model's tokens collapse to a near-single point in the residual stream, that is a statement about internal geometry; whether and how it affects the model's predictions is a separate question Phase 2 takes up.

Some Terminology

- **Sphere projection**— each token's vector is divided by its length, so all tokens live on the unit sphere. Similarity between two tokens is their inner product (cosine of the angle): +1 same direction, 0 orthogonal, −1 opposite.

- **β (inverse temperature)** — how sharply attention concentrates. β = 0 is uniform attention; large β is near-hard selection. The theory's metastability is a *large-β* phenomenon.

- **Mass-near-1** — fraction of token pairs with inner product above 0.9. A direct read on "how much of the cloud has clustered." Goes from 0 to 1.

- **Effective rank** — how many dimensions the token cloud effectively spans. Falls toward 1 as the cloud collapses to a point.

- **CKA**— a single 0-to-1 score for how similar one layer's whole geometry is to the next. Near 1 across several layers = a plateau.

- **HDBSCAN** — a clustering algorithm that discovers the number of clusters from the data and can label a token "noise" if it belongs to no cluster yet.

- **Plateau** — a run of layers over which a signal stays flat; the operational stand-in for a metastable window.

- **Fiedler value (λ₂)** — a graph measure of how close the attention graph is to falling into separate pieces. Near 0 = attention split into clusters; large = everything mixing.

- **E_β (interaction energy)** — a scalar the theory's dynamics must push monotonically *up*. A step where it falls is a "violation" — evidence of a repulsive force the idealized model cannot produce.

- **ρ(V) (spectral radius)** — the largest eigenvalue magnitude of the value matrix; the per-layer amplification factor that sets collapse speed.

- **OV circuit** — the composed output-times-value map W_O W_V; the residual-stream-to-residual-stream operator Phase 2 argues is the right object to eigendecompose (Phase 1 uses V alone as a proxy).

The paper this post examines proves that, in an idealized transformer, token representations sitting on the unit sphere are pulled together by attention, group into clusters, and in the long run collapse to a single point. The proof is clean, but it holds under an assumption no trained model satisfies: that the query, key, and value projections are all the identity matrix.

This post answers one question. Does the clustering-and-metastability picture the paper establishes for the idealized model show up in real trained transformers — and where does it break once Q, K, and V are arbitrary learned matrices? This is Phase 1: *what happens*. We measure whether the predicted behavior appears, and we record exactly where the trained models diverge from the theory. Phase 2, which follows, takes the divergences Phase 1 finds and asks *why* — what mechanism in the learned weights produces them.

The model: every token is a point on the unit sphere. At each layer, attention moves each token toward a similarity-weighted average of the others — tokens it already points toward pull on it hardest, with the sharpness of that weighting set by an inverse temperature β. Similar tokens attract. Iterate the map and the tokens drift together; the long-run limit is a single point, every token identical. Consensus.

The part that makes the theory interesting is not the endpoint but the route to it. For large β the collapse does not happen all at once. It happens on two separate timescales. First, fast: the cloud snaps into a small number of distinct clusters. Then, slow: those clusters merge pairwise, one absorption at a time, until only one survives. Between the moment the clusters form and the moment the last two merge, the structure is effectively frozen — a long window in which the cluster count holds steady while the dynamics idle. That frozen window is **metastability**, and the paper conjectures it appears for large β, before the final collapse. A claim about metastability is therefore a claim about *time*: not just that clusters exist, but that they persist for a stretch that is long relative to how quickly they formed.

The gap that makes this an empirical project is the Q = K = V = I assumption. Under it, attention is pure symmetric attraction with no learned reshaping. A trained model has none of that. The query and key matrices reshape which tokens count as similar in the first place; the value matrix reshapes how a token actually moves once it is pulled. The idealized dynamics are purely attractive — they can only ever bring tokens together — but a learned value matrix can carry negative or complex eigenvalues, which would let a layer push tokens *apart*. None of the paper's convergence guarantees transfer directly once these matrices are arbitrary. The question is how much of the qualitative behavior survives the move from the idealized map to a learned one.

Concretely, the paper leaves several open problems; this post tests three of them, by content rather than by number. The first is whether the clustering and the two-timescale metastability survive in trained models at all (Groups A and B). The second is whether the gradient-flow structure survives — the theory's dynamics ascend a specific energy, and that energy must increase monotonically; we test whether it does (Group D). The third concerns what governs convergence speed: the paper's Theorem 6.1 ties faster collapse to higher dimension, and we test that prediction directly (Group E). One instrument group sits alongside these to check the *mechanism* — whether attention itself, not just the resulting geometry, shows the structure the theory requires (Group C).

Seven models, chosen so that three independent axes can be read off the same measurements, plus one untrained control.

The GPT-2 family — small, medium, large, and xl — is the depth sweep: 12, 24, 36, and 48 transformer blocks at model dimensions of 768, 1024, 1280, and 1600, all sharing the same decoder architecture and causal attention mask. Holding the architecture fixed and varying depth is what lets any depth-threshold claim be tested within a single family. BERT-base is the encoder counterpart: 12 layers, dimension 768, bidirectional attention. Comparing it against GPT-2 isolates the encoder-versus-decoder axis, and — as it turns out — exposes a confound, since GPT-2's causal mask can manufacture cluster-like attention structure on its own (Group C).

ALBERT-base and ALBERT-xlarge supply the third axis. ALBERT shares a single transformer layer's weights across all of its depth, applying the same map repeatedly. That makes the layer index a literal iteration count, so for ALBERT "depth" is a clean time knob — the closest thing among real models to the paper's iterated map, and the natural place to look for two-timescale behavior. We run ALBERT to extended iteration counts (a dense sweep up to ~48–60 iterations) precisely to give the slow timescale room to play out. The two ALBERT sizes also give a dimension contrast at otherwise matched architecture: base at dimension 768, xlarge at 2048. That single pair is the decisive test of the Theorem 6.1 dimension prediction (Group E), because everything except dimension is held roughly fixed.

A randomly-initialized `albert-base-v2-random` — the ALBERT-base architecture with untrained weights — is run as a control. It answers the question that makes the whole project non-trivial: is the clustering *learned*, or does the architecture produce it with any weights at all? If a random-weight model clustered the same way, "trained transformers cluster their tokens" would be a statement about the wiring, not about what training discovers. This is the null the universality claims are measured against, and its result belongs in the verdict table.

**Prompts.** The prompt set is deliberately diverse, so that "universal" means *across language, modality, and length*, not across four similar English paragraphs. The core English prose prompts span a wide length range — `short_heterogeneous` (~23 tokens, semantically mixed), `paper_excerpt` (~306 tokens), `wiki_paragraph` (~450 tokens), and `sullivan_ballou` (~489 tokens) — and the set has since been extended to roughly nine real prompts adding at least one **non-English** prompt, **source code**, and **LaTeX / mathematical notation**, plus a couple of others. The diversity is itself a result: if clustering, plateaus, and energy violations appear in code and in equations as readily as in English prose, the phenomenon is a property of the dynamics, not of natural-language statistics. It also opens a sub-question worth answering explicitly in the synthesis — *does any prompt class behave differently?* If code or LaTeX clusters on a visibly different timescale, that is a new sub-finding; if it does not, "even code and equations cluster the same way" is a strong line. A length-sweep facility can truncate `wiki_paragraph` to a series of token targets (`wiki_64`, `wiki_128`, …) to separate prompt length from prompt content as a variable.

The degenerate control, `repeated_tokens`, is a string of one token repeated. It has no real structure to cluster around, so it serves one purpose only — a collapse-speed diagnostic — and is **excluded from every metastability aggregation**, reported separately under a collapse-controls heading. An earlier version of the two-timescale ratio misused this control's collapse layer as a denominator and produced a self-contradictory table; that error is corrected in Group E, and the control is now confined to its diagnostic role.

**What we extract.** For every (model, prompt, layer) triple, two raw objects: the residual-stream activations and the attention weights. The activations are L2-normalized onto the unit sphere — one line of code — and that projection is the entire bridge between the paper's geometry and the model's internals. From the sphere-projected activations we form the per-layer Gram matrix of pairwise inner products. Every quantity in every group is computed from these objects: the Gram matrix, the normalized activations, and the attention weights. Nothing else is stored; everything else is derived.

The analysis runs in five groups, each answering one question, in an order that builds from existence to mechanism to the theory's sharpest prediction. Group A asks whether clusters form at all. Group B asks whether they *persist* — the operational test of metastability. Group C turns to the attention matrix and asks whether it carries the same story the geometry does. Group D tests the theory's most falsifiable claim, that a specific interaction energy increases monotonically. Group E asks whether the two timescales are genuinely separate, and what governs the gap.

Each group below opens with its answer, then the question, then the finding. The method and concept boxes are in dropdowns; the result does not depend on opening them.

*Question: does the token cloud actually reorganize into distinct groups as depth increases, and does the reorganization look like what the theory predicts?*

The three instruments in this group each answer a slightly different version of the same question. The inner-product histogram and mass-near-1 ask whether the raw pairwise geometry is changing. Effective rank asks whether the token cloud is collapsing onto fewer dimensions. HDBSCAN asks whether that collapse is structured — i.e., whether distinct groups are forming rather than a smooth smear. Together they triangulate on "yes, clusters form" without relying on any single algorithm or threshold.

Every instrument in this group reads from two objects computed once per layer: the sphere-projected activations and their Gram matrix. Everything downstream is a function of those two. The projection is the entire bridge between the paper's geometry and the model's residual stream — one line:

``` python
import torch.nn.functional as Fdef layernorm_to_sphere(activation: torch.Tensor) -> torch.Tensor:    """L2-normalize each token vector onto the unit sphere."""    return F.normalize(activation, p=2, dim=-1)
```

The per-layer loop then computes the normed activations and the Gram matrix once, and hands both to every metric below:

```
normed = layernorm_to_sphere(activations).numpy()   # (n_tokens, d)  — on S^{d-1}G      = normed @ normed.T                           # (n_tokens, n_tokens) — ⟨x_i, x_j⟩
```

`G[i, j]`

is exactly $\langle x_i, x_j \rangle$. Every quantity in Group A is read off `G`

or off `normed`

.

*At each layer, we compute the pairwise inner products between every pair of token representations and histogram them. Mass-near-1 is the fraction of pairs above 0.9.*

**Why this instrument:** The paper's entire theoretical machinery is built around what happens to pairwise inner products over time. Its own Figure 1 shows this exact histogram for ALBERT XLarge. Running the same diagnostic is the most direct possible replication: if the paper's qualitative story holds in trained models, this histogram should show a spike migrating toward 1 as depth increases. If the spike never forms, the clustering story is over before it starts.

<details> <summary>What are inner products on the sphere, and how do you read the histogram?</summary>

Layer normalization — specifically RMS normalization, which every model here uses — divides each token's residual-stream vector by its Euclidean norm before passing it to the next layer. After this operation, every token vector $x_i$ satisfies . The token cloud lives on the unit sphere .

This matters because it changes what "distance" means. On the sphere, the natural measure of similarity between two points is their inner product:

where is the angle between the two vectors. The inner product is exactly the cosine similarity when both vectors are unit length.

The range is :

For a layer with $n$ tokens, there are distinct pairs. Compute every pairwise inner product and put them in a histogram. The x-axis runs from to . The y-axis is density (normalized so the histogram integrates to 1).

Reading the histogram across layers:

**Early layers — spread near 0.** In high dimensions, random vectors on the sphere concentrate near the equator. By concentration of measure, most pairwise inner products between randomly-oriented high-dimensional unit vectors land close to 0. A histogram piled near 0 at early layers isn't surprising — it's the expected baseline before the dynamics have done anything. This is true for , which is always the case here ().

**Clustering in progress — a spike growing at 1.** As layers push similar tokens together, a subset of pairs achieves high inner product. The histogram develops a second peak migrating toward 1 while the main mass stays near 0. Each such peak is a cluster: the pairs at high inner product are tokens that have been pulled together.

**Full collapse — everything at 1.** When all tokens have merged into a single point on the sphere, every pairwise inner product equals 1. The histogram is a single spike at 1. This is what the theory's long-run prediction (consensus / a single Dirac mass) looks like empirically.

Tracking the full histogram across all layers, models, and prompts produces an unwieldy number of plots. Mass-near-1 compresses the histogram into a scalar: the fraction of pairs with inner product above a threshold (we use 0.9).

This is a direct, threshold-based read on "how much of the token cloud has clustered so far." A layer where mass-near-1 goes from 0 to 0.4 in one step is a layer where 40% of pairs snapped into high agreement at once — a merge event.

One alternative is Euclidean distance. We don't use it here because the paper's energy, theorems, and convergence results are all stated in terms of inner products on the sphere. Using the same quantity makes comparison direct. Also, on the unit sphere, , so Euclidean distance is a monotone function of inner product anyway — they carry the same information, just on different scales.

Another alternative is cosine similarity computed on un-normalized activations. We don't use it because layer normalization is part of the model's computation, not a preprocessing choice. The normalized representations *are* what the model is computing with; measuring inner products on them is measuring the quantity the paper's dynamics track.

Once mass-near-1 reaches near 1.0, the token cloud is essentially a point. This is consistent with the paper's prediction but it also means the model has lost all token-level diversity — every token has the same representation. Whether this happens in the residual stream *before* the output head, and what it implies for the model's function, is a separate question. Phase 1 documents when and whether it happens; Phase 2 investigates why.

The code: from Gram matrix to histogram and mass-near-1

The pair extraction is one helper. It takes the upper triangle of `G`

(each pair once, no self-pairs):

``` php
def pairwise_inner_products_from_gram(G: np.ndarray) -> np.ndarray:    """Upper-triangle pairwise cosine similarities from a pre-computed Gram matrix."""    n   = G.shape[0]    idx = np.triu_indices(n, k=1)   # k=1 drops the diagonal (self-pairs = 1.0)    return G[idx]
```

The histogram, the summary scalars, and mass-near-1 are then four lines in the per-layer loop:

```
ips                  = pairwise_inner_products_from_gram(G)lr["ip_mean"]        = float(ips.mean())lr["ip_std"]         = float(ips.std())lr["ip_histogram"]   = np.histogram(ips, bins=50, range=(-1, 1))[0].tolist()lr["ip_mass_near_1"] = float((ips > 0.9).mean())
```

The last line is the formula above, verbatim: `(ips > 0.9)`

is the indicator $\mathbf{1}[\langle x_i, x_j \rangle > 0.9]$, and `.mean()`

over the upper triangle is the division by $\binom{n}{2}$. The histogram uses 50 fixed bins on $[-1, 1]$ — fixed range, so the bin edges are comparable across every layer, model, and prompt without renormalizing. This is the array the Figure-1 replication plots: `results["layers"][li]["ip_histogram"]`

is fed directly to the bar chart, one panel per layer.

There is no separate code path for the histogram and for mass-near-1 — they read the same `ips`

array. The scalar is just the histogram with everything above the 0.9 bin edge summed.

**What came out:** Clustering is universal — every model, every real prompt, develops the migrating spike at 1. The *endpoint* is where architecture splits. ALBERT-base drives MaxMass to ~1.0 (full collapse). ALBERT-xlarge, despite being the model the paper's own Figure 1 uses, stays below ~0.30 mass-near-1 even at 48 iterations — its dynamics are slow enough that 48 layers is not enough to collapse. The GPT-2 family clusters hard on long prompts (gpt2-small reaches ~0.87–0.97) but does *not* collapse its degenerate control, which is the expected signature: the control has no real structure to cluster around.

*At each layer, we compute how many dimensions the token cloud is effectively spread across — a continuous, threshold-free measure of geometric collapse.*

**Why this instrument:** Mass-near-1 measures whether pairs have aligned. Effective rank measures whether the *whole cloud* is collapsing onto a low-dimensional subspace. These are related but distinct: you can have many aligned pairs without full collapse (multiple separate clusters), and the way effective rank falls tells you about the geometry of the collapse process. It also serves as the degeneracy gate for several other metrics — when the cloud is essentially a point, those other metrics become meaningless and we suppress them rather than report noise.

What is effective rank, and why does the entropy form matter?

The naive answer to "how many dimensions is this cloud using" is: count the singular values of the activation matrix that are above some threshold. This is fragile. The threshold is arbitrary — changing it from 0.01 to 0.001 can double the count. There's no principled choice, and the right threshold varies across models, layers, and prompt lengths.

Take the activation matrix (n tokens, d dimensions per token). Compute its singular values . Convert them to a probability distribution by normalizing, then take the exponential of the Shannon entropy:

p_k = \frac{\sigma_k}{\sum_j \sigma_j}, \qquad \text{effective_rank}(X) = \exp!\left(-\sum_k p_k \log p_k\right)

This is the effective rank of Roy and Vetterli (2007). *(See the code box below for an implementation note: an earlier draft of this section wrote the normalization over squared singular values **; the implemented metric normalizes the singular values directly, which is the Roy–Vetterli definition. The two differ.)*

Entropy measures how spread-out a distribution is. A distribution concentrated on one value has entropy 0; a uniform distribution over values has entropy . Exponentiating maps entropy back to a "number of effective components."

Applied to the singular value spectrum:

The key advantage is that it's smooth and weights directions by their share of the spectrum. A direction carrying a negligible fraction of the total barely contributes, even though it would count as a "nonzero singular value" in a threshold-based count.

Effective rank tends to start high at early layers (many directions contribute roughly equally) and fall as clustering progresses (token representations align, the spectrum concentrates in fewer directions). The endpoint depends on architecture:

Unlike mass-near-1, effective rank tracks the *full distribution* of the spectrum, not just the high-agreement pairs. A model with two perfectly separated clusters of equal size will show mass-near-1 = 0 (no cross-cluster pairs near 1) but effective rank ≈ 2 (two principal directions). They're measuring different things and the combination is more informative than either alone.

When effective rank drops low enough, the token cloud is essentially a point-mass in all but name. At this point:

**CKA between consecutive layers goes trivially to 1.** Any two representations that are both near-point-masses have nearly identical Gram matrices — CKA can't distinguish them. A CKA value of 0.99 at a degenerate layer means "both layers are collapsed," not "the geometry is stable in an interesting way."

**Nearest-neighbor assignment becomes noise.** When all tokens are nearly identical, which token is nearest to which is determined by floating-point rounding at the scale of $10^{-6}$. NN-stability values computed here are not interpretable.

**Spectral cluster counts are meaningless.** The Laplacian eigengap method finds clusters in a graph; a near-point-mass graph has no structure to find.

Rather than report these quantities and let them corrupt the analysis, we gate them on effective rank and suppress them below the threshold (see code box for the exact gate values). The effective rank itself is still informative at rank 2 — a 2D cloud on the sphere can still have meaningful structure (two clusters separated by a great circle).

There are two variants. **Raw effective rank** is computed on the activation matrix before sphere-projection; it reflects both directional spread and how much the residual norms vary across tokens. **Normed effective rank** is computed on the sphere-projected activations and measures purely directional spread. For the "clustering on the sphere" story — what the paper's dynamics are about — normed is the conceptually right quantity, but the degeneracy gate uses raw effective rank because it is more conservative (it folds in norm collapse too, avoiding false negatives).

The code: the metric and the gate

The metric itself is five lines:

``` python
from scipy.linalg import svdvalsdef effective_rank_from_raw(activations: torch.Tensor) -> float:    """    Spectral entropy of singular values.    SVD runs on raw activations, not L2-normed ones. L2 normalization sets    every token's norm to 1, collapsing the inter-token scale variation that    the singular values measure — svdvals(normed) gives a different answer.    Named _from_raw to make the contract explicit at call sites.    """    sv      = svdvals(activations.numpy())    sv      = sv[sv > 1e-10]              # drop numerical zeros    sv_norm = sv / sv.sum()              # normalize to a probability distribution    entropy = -np.sum(sv_norm * np.log(sv_norm + 1e-12))    return float(np.exp(entropy))
```

**Implementation note — squared vs. unsquared.** The concept box above (in an earlier draft) wrote $p_k = \sigma_k^2 / \sum_j \sigma_j^2$, i.e. normalizing the *variances* / eigenvalues of $X^\top X$. The code normalizes the singular values directly, `sv / sv.sum()`

. These are not the same function; the squared form puts more weight on the leading directions and reports a lower effective rank. The implemented version is the original Roy–Vetterli (2007) definition. If you want the variance-weighted version, square `sv`

before the normalize line. The findings below are computed with the unsquared form as shown.

Two rank variants are now computed per layer. The raw rank drives every degeneracy gate; the normed rank (directional spread on the sphere, independent of residual-norm growth) is kept for reporting:

```
# Raw: captures scale + directional collapse; used for all gates.# Normed: directional spread on the sphere only; for reporting.lr["effective_rank"]        = effective_rank_from_raw(activations)lr["effective_rank_normed"] = effective_rank_from_normed(normed)
```

The gate is then applied at the call site. Effective rank is computed first, then used to decide whether CKA is even meaningful at this layer:

``` python
from core.config import DEGENERATE_RANK_THRESHOLD   # = 2# CKA vs previous layer — suppressed when the cloud is a near-point-mass.# Centering then produces noise-dominated vectors and the Frobenius norms# collapse to near-zero, so the ratio is numerically meaningless.if prev_normed is not None and lr["effective_rank"] >= DEGENERATE_RANK_THRESHOLD:    lr["cka_prev"] = linear_cka(normed, prev_normed)else:    lr["cka_prev"] = float("nan")
```

The gate is a single shared constant, `DEGENERATE_RANK_THRESHOLD = 2`

, used identically by CKA, NN-stability, and energy-drop suppression. It was previously split (CKA gated at `>= 3.0`

, NN at `>= 2.0`

); unifying it at 2 matches the geometric claim in the concept box above — at rank 2 the cloud is still genuinely 2-D on the sphere (two clusters separated by a great circle), so CKA and NN remain meaningful, while rank ≈ 1 is the point-mass regime where they become float-noise. Degenerate layers drop out of the plateau analysis rather than contributing false near-1 values. Raise the threshold in one place if post-rerun rank-2 CKA turns out to be erratic.

how we know the metric is correct (unit tests)

Two degenerate cases pin the behavior at both ends of the range:

``` python
def test_rank1_matrix_returns_one(self, rank1_tensor):    # Every row identical (outer product v⊗w) → one non-zero singular value    # → entropy 0 → effective rank 1.    rank = effective_rank_from_raw(rank1_tensor)    assert rank == pytest.approx(1.0, abs=1e-4)def test_uniform_sv_returns_d(self, uniform_sv_tensor):    # D equal singular values (orthonormal columns from QR) → entropy log(D)    # → effective rank exp(log(D)) = D.    rank = effective_rank_from_raw(uniform_sv_tensor)    assert rank == pytest.approx(D, abs=0.5)
```

A fully collapsed cloud returns 1.0; a maximally spread one returns $d$. Anything observed in between is a real interpolation, not an artifact of the estimator.

**What came out:** Effective rank tracks mass-near-1 inversely — as the spike at 1 grows, the cloud collapses onto fewer directions, exactly as it should if the two are measuring the same underlying process from different angles. The architecture split reappears here. ALBERT-base on a Wikipedia paragraph collapses to MinRank ~1.4 (essentially a point). ALBERT-xlarge on long prompts stays above ~55 — it never enters the degenerate regime within its depth. The GPT-2 family collapses hard: gpt2-small to ~1.59, gpt2-large to ~5.45. The gate fires often enough on the collapsing models that their late layers are correctly excluded from CKA and NN-stability rather than reported as spurious stability.

*HDBSCAN is a density-based clustering algorithm that discovers the number of clusters from the data and assigns a "noise" label to tokens that don't belong to any cluster.*

**Why this instrument:** Effective rank tells you the cloud is collapsing onto fewer dimensions. HDBSCAN asks whether that collapse is structured into distinct groups. It's the primary cluster-membership method throughout the analysis, for two reasons. First, we don't know the number of clusters in advance — it varies by model, layer, and prompt, and an algorithm that requires k upfront would require us to assume the answer. Second, "this token belongs to no cluster yet" is a meaningful state during metastability, and HDBSCAN can say that honestly rather than forcing every token into the nearest group.

<details> <summary>What is HDBSCAN, and why does it handle this setting better than k-means?</summary>

You have $n$ token vectors on the sphere at a given layer. You want to know: how many distinct groups are there, and which tokens belong to which group?

The answer should handle:

K-means fails on all three counts.

K-means partitions $n$ points into exactly $k$ clusters by minimizing the sum of squared distances from each point to its assigned centroid. The problems in this setting:

**Requires $k$ upfront.** During metastability the number of clusters is the quantity we're trying to measure. Specifying it is assuming the answer.

**Forces every point into a cluster.** A token in transition between two clusters — on the boundary, not yet committed — gets assigned to whichever centroid is closer. This manufactures false membership.

**Assumes spherical, equally-sized clusters.** K-means Voronoi cells are convex and symmetric. Real token clusters on the sphere can be elongated, have different densities, and different sizes. K-means will split large clusters and merge small ones to equalize sizes.

**Non-deterministic.** K-means depends on centroid initialization and can find different local minima on different runs. This makes results harder to reproduce and interpret.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a different approach: define clusters as dense regions separated by sparse regions.

The core insight: if tokens are genuinely clustering, there should be regions of high local density (the cluster cores) separated by low-density gaps. A point is a "core point" if it has at least `minPts`

neighbors within distance $\epsilon$. Clusters grow by linking neighboring core points. Points reachable from a core point but not core themselves are border points. Points in no core point's neighborhood are noise (label = $-1$).

This handles variable-shape clusters naturally — density is local, so elongated clusters work fine. And the noise label is honest: it doesn't force boundary tokens into clusters.

The problem with vanilla DBSCAN: it uses a flat density threshold $\epsilon$ everywhere. This fails when clusters have different densities. A threshold that captures a sparse cluster will merge two nearby dense clusters; a threshold that separates those dense clusters will dissolve the sparse one.

HDBSCAN (McInnes and Healy, 2017) fixes this by building a full density hierarchy.

**Step 1: Mutual reachability distance.** For each point, compute its core distance $d_{\text{core}}(p)$ — the distance to its $k$-th nearest neighbor. The mutual reachability distance between points $p$ and $q$ is:

In sparse regions, distances are "stretched" (replaced by the larger core distances), so sparse points are pushed apart. In dense regions, points are already closer than their core distances, so nothing changes. The result is a distance metric robust to density variation.

**Step 2: Minimum spanning tree (MST).** Build the MST on the full $n \times n$ mutual reachability distance matrix. This is the skeleton of the density structure.

**Step 3: Build the condensed cluster tree.** Simulate removing MST edges in order of increasing mutual reachability distance (equivalently: decreasing density threshold), tracking which clusters split off. This produces a dendrogram of cluster births and deaths.

**Step 4: Extract stable clusters.** Each branch has a "persistence" — how long it survives as a distinct cluster as the threshold rises. The algorithm selects the most persistent set of non-overlapping clusters. Branches that split off and die quickly are absorbed back as noise.

The result is cluster assignments for tokens in stable dense regions, a noise label ($-1$) for tokens in low-density regions or transitions, and a cluster count that was *discovered*, not specified.

During metastability, some tokens have committed to a cluster and others haven't yet. The noise label captures this. A layer where 30% of tokens are labeled noise isn't a failure of the algorithm — it's a signal that the clustering process is mid-transition. Tracking the noise fraction across layers is itself a diagnostic.

We apply HDBSCAN to the sphere-projected token vectors using cosine distance ($1 - \langle x_i, x_j \rangle$) rather than Euclidean distance. This is appropriate because the tokens live on $\mathbb{S}^{d-1}$ and the paper's dynamics are stated in terms of angular relationships.

For most models, we cross-validate the cluster count from HDBSCAN against the spectral eigengap method (Group B). For ALBERT-xlarge, the spectral method fails: a zero mode dominates the Laplacian and it returns $k = 1$ at every layer regardless of actual structure. HDBSCAN doesn't have this failure mode — it operates on local pairwise distances, not global eigenstructure. So for ALBERT-xlarge, all cluster counts, merge events, and plateau identifications are sourced from HDBSCAN. Any `nMerges = 0`

in a spectral-sourced column for ALBERT-xlarge is an artifact of spectral degeneracy; genuine merges are recorded by the HDBSCAN trajectory tracker (see Group B).

</details> <details> <summary>T</summary>

The code: HDBSCAN on cosine distance, with the noise label

Cosine distance is precomputed from the normed activations, then handed to HDBSCAN as a precomputed metric:

``` python
from sklearn.metrics import pairwise_distancesimport hdbscan# 1 − ⟨x_i, x_j⟩ on the sphere; clip removes tiny negative round-offcos_dist = np.clip(pairwise_distances(normed, metric="cosine"), 0, None)hdb        = hdbscan.HDBSCAN(min_cluster_size=2, metric="precomputed")hdb_labels = hdb.fit_predict(cos_dist.astype(np.float64))   # −1 = noise# Cluster count excludes the noise labeln_clusters = len(set(hdb_labels)) - (1 if -1 in hdb_labels else 0)
```

Three things to read off this:

`metric="precomputed"`

means HDBSCAN never sees the vectors, only the cosine-distance matrix. The angular geometry is the only thing it clusters on — consistent with the paper's inner-product dynamics.`min_cluster_size=2`

is deliberately permissive: a "cluster" can be a single merged pair, which is what an early merge event looks like.`n_clusters`

subtracts 1 if and only if the noise label is present. The noise tokens (`-1`

) are counted separately, never folded into a cluster. The noise `(hdb_labels == -1).mean()`

— is tracked per layer as the mid-transition signal described above.The labels array is what feeds every downstream consumer: the trajectory tracker that records merges across layers (Group B), the per-pair semantic/artifact tagging, and the multi-scale nesting analysis all take `hdb_labels`

as their starting point.

</details> <details> <summary>The code: why not k-means — the test that pins the failure modes</summary>

The cluster-count sweep runs both HDBSCAN and a k-means silhouette search, which makes the k-means failure modes directly observable in the tests:

``` python
def test_antipodal_kmeans_best_k_is_2(self, antipodal_normed):    # Two tight clusters at opposite poles. Silhouette peaks at k=2.    result = cluster_count_sweep(antipodal_normed)    assert result["kmeans"]["best_k"] == 2def test_collapsed_agglomerative_count_is_1(self, collapsed_normed):    # Fully collapsed cloud: every cosine distance ~1e-6, below all thresholds.    # Must return a single cluster.    result = cluster_count_sweep(collapsed_normed)    assert result["agglomerative"][_MID_THRESH] == 1
```

K-means recovers the right answer only when you already know to look for $k = 2$ and the clusters happen to be the clean, equal-sized, antipodal case it assumes. HDBSCAN gets the count without that assumption, which is why it — not k-means — is the authoritative membership source throughout.

**What came out:** At the plateau layers identified in Group B, HDBSCAN recovers semantically coherent groups reproducibly — the discovered clusters are not run-to-run artifacts. ALBERT-xlarge trajectories are shorter and noisier than GPT-2's, consistent with its slower dynamics: the clusters are still forming and dissolving rather than locking in. For ALBERT-xlarge specifically, HDBSCAN is the authoritative cluster count, because the spectral method degenerates to $k=1$ there and would otherwise report no structure at all.

*Code excerpts are from *`core/models.py`

* (*`layernorm_to_sphere`

*), *`core/metrics.py`

* (*`pairwise_inner_products_from_gram`

*, *`effective_rank_from_raw`

*), *`core/clustering.py`

* (*`cluster_count_sweep`

*), *`analysis.py`

* (per-layer loop and gates), and the Phase 1 test suite. Snippets are lightly trimmed (comments and unrelated branches removed) but otherwise verbatim.*

Finding — CONFIRMED.Clustering is universal: every model develops the migrating histogram spike at +1, and effective rank falls inversely as mass-near-1 rises — two instruments measuring the same process from different angles. The split is in the endpoint. ALBERT-base drives mass-near-1 to ~1.0 and effective rank to ~1.4 (full collapse to a point). ALBERT-xlarge — the very model the paper's Figure 1 uses — stays below ~0.30 mass-near-1 and above ~55 effective rank even at 48 iterations: its dynamics are slow enough that 48 layers is not enough to collapse it. The GPT-2 family clusters hard on real prompts (gpt2-small reaches ~0.87–0.97) but does *not* collapse its degenerate control — the expected signature of structure-driven clustering. At the plateau layers (Group B), HDBSCAN recovers semantically coherent groups reproducibly, not run-to-run artifacts. For ALBERT-xlarge specifically, HDBSCAN is the authoritative cluster count, because the spectral method degenerates to k = 1 there.

*Question: once clusters form, do they stay stable across multiple layers — or do they immediately dissolve? This is the operational test of metastability.*

Group A established that clusters form. That alone is not metastability. Metastability is a claim about *time*: the structure has to survive for a stretch of layers — a plateau — before the slow merging finishes the collapse. If clusters formed and immediately dissolved, the histogram spike would still appear, but there would be no metastable window. So Group B measures persistence directly, from four angles.

Two of the instruments measure whether the representation is *holding still*: CKA compares the whole geometry of one layer to the next, and nearest-neighbor stability asks the same question at the discrete level of individual token relationships. The third turns "holding still" into a concrete window — the plateau-detection routine that decides where a metastable stretch starts and ends. The fourth follows individual clusters through those windows and records when they merge, which is the actual mechanism the slow timescale is made of.

Everything here reads from the same per-layer objects as Group A — the sphere-projected activations, the Gram matrix, and the HDBSCAN labels — plus one new cross-layer dependency: each instrument compares layer $L$ to layer $L-1$, so the analysis loop carries the previous layer's state forward.

*At each layer, we compare the entire representational geometry to the previous layer with a single similarity score in $[0, 1]$. A run of values near 1 is a plateau; a sharp drop marks its end.*

**Why this instrument:** "Do clusters persist" is a question about whether the layer-to-layer map is approximately the identity over some stretch. CKA answers it at the level of the whole representation rather than any one cluster: it asks whether the pairwise-similarity structure of all tokens is preserved from one layer to the next. It is the primary plateau signal because it is basis-free and scale-invariant — it does not care how the residual stream rotates or rescales between layers, only whether the relational geometry is the same.

<details> <summary>What is CKA, what is HSIC, and what does it actually hold invariant?</summary>

At layer $L$ you have an $n \times d$ activation matrix $X$ (sphere-projected). At layer $L-1$ you have $Y$, same shape. You want one number saying how similar these two *representations* are — not token-by-token, but as geometries.

The natural object is the **representational similarity matrix** (RSM): the $n \times n$ Gram matrix $XX^\top$, whose $(i,j)$ entry is $\langle x_i, x_j \rangle$. Two representations are "the same geometry" if their RSMs agree, regardless of how the underlying coordinates are oriented. This is exactly the inner-product matrix from Group A — CKA is built on top of the same object.

The Hilbert-Schmidt Independence Criterion measures whether two sets of features are statistically related. For linear kernels and , the (unnormalized) linear HSIC reduces to

after centering. This is the alignment between the two RSMs: it is large when pairs of tokens that are close in $X$ are also close in $Y$.

HSIC on its own scales with the magnitudes of $X$ and $Y$, so it is not comparable across layers. CKA normalizes it by the self-similarities:

The result is in $[0, 1]$: 1 means the two RSMs are identical up to an orthogonal transform and isotropic scaling; 0 means they are orthogonal.

This is the property that makes CKA the right plateau instrument:

So a CKA plateau near 1 means: across these layers, the model is rotating and rescaling the token cloud but not reorganizing which tokens are close to which. That is precisely the metastable picture — the clusters are fixed, the dynamics are idling.

CKA is computed on mean-centered activations. Without centering, a single large shared component — a token like `[CLS]`

that every representation has a big projection onto — inflates the similarity regardless of the actual cluster structure. Centering removes that shared offset so CKA reflects the relational structure, not a common bias.

The code: the metric and the plateau read-off

The metric is the normalized HSIC formula, verbatim:

``` php
def linear_cka(X: np.ndarray, Y: np.ndarray) -> float:    """    CKA(X, Y) = ||Y^T X||_F^2 / (||X^T X||_F * ||Y^T Y||_F)    X, Y : (n_tokens, d) — already L2-normed; both mean-centered internally.    """    X = X - X.mean(axis=0, keepdims=True)    Y = Y - Y.mean(axis=0, keepdims=True)    # ||Y^T X||_F^2 = tr(X^T Y Y^T X) = ||X.T @ Y||_F^2    YtX       = Y.T @ X                          # (d, d)    numerator = float(np.sum(YtX ** 2))    XtX_norm  = float(np.linalg.norm(X.T @ X, "fro"))    YtY_norm  = float(np.linalg.norm(Y.T @ Y, "fro"))    denom     = XtX_norm * YtY_norm    if denom < 1e-12:        return float("nan")    return float(np.clip(numerator / denom, 0.0, 1.0))
```

In the per-layer loop it is called only when the layer is non-degenerate (the Group A gate), so collapsed layers return `nan`

rather than a spurious 1.0:

``` python
from core.config import DEGENERATE_RANK_THRESHOLD   # = 2if prev_normed is not None and lr["effective_rank"] >= DEGENERATE_RANK_THRESHOLD:    lr["cka_prev"] = linear_cka(normed, prev_normed)else:    lr["cka_prev"] = float("nan")
```

A CKA *plateau* is then just a flat run of this series. The plateau detector (see B.3) is run on the CKA values with a tight tolerance, because CKA near 1 should be very flat inside a real metastable window:

```
cka_pairs    = [(r["layer"], r["cka_prev"]) for r in layers                if not np.isnan(r["cka_prev"])]   # drop layer 0 + degenerate layerscka_series   = [v for _, v in cka_pairs]plateaus_cka = detect_plateaus(cka_series, window=2, tol=0.02)
```

The end of a plateau is flagged separately as the sharpest single-step drop — the layer where consecutive-layer CKA falls the most is the clearest signal that a metastable window has ended:

```
diffs    = np.diff(valid_vs)drop_pos = int(np.argmin(diffs))if diffs[drop_pos] < -0.05:    severity = "SHARP" if diffs[drop_pos] < -0.15 else "MILD"    # marks L → L+1 as the plateau-ending reorganization
```

**What came out:** CKA shows reproducible flat-near-1 runs punctuated by sharp drops — the plateau-and-merge signature the metastability conjecture predicts. The plateaus line up with the windows where the other signals (mass-near-1, spectral-k, HDBSCAN-k) are also flat, which is what gives the plateau identification in B.3 something to agree on. Late degenerate layers correctly drop out as `nan`

rather than producing a trivial CKA = 1, so a collapsed model does not masquerade as a long stable plateau. (Note: some report headers still print "effective_rank < 3" as the suppression label — stale display text; the active gate is `DEGENERATE_RANK_THRESHOLD = 2`

.)

*At each layer, for every token we record which other token is its nearest neighbor. NN-stability is the fraction of tokens whose nearest neighbor is unchanged from the previous layer.*

**Why this instrument:** CKA is a continuous, global measure — it can sit at 0.97 while the fine-grained cluster membership quietly churns underneath. NN-stability is the discrete complement: it tracks identities, not geometry. A token's nearest neighbor either changed or it didn't. This catches reorganization that CKA averages away — if tokens are swapping partners while the overall geometry stays roughly fixed, CKA stays high but NN-stability drops. Used together, a plateau where *both* are high is a much stronger persistence claim than either alone.

<details> <summary>What NN-stability catches that CKA misses</summary>

For each token $i$ at a layer, its nearest neighbor is $\text{NN}(i) = \arg\max_{j \neq i} \langle x_i, x_j \rangle$ — the token it points most directly at on the sphere. NN-stability between layer $L-1$ and $L$ is the fraction of tokens whose nearest neighbor index is identical:

$$\text{NN-stability} = \frac{1}{n} \sum_i \mathbf{1}[\text{NN}*L(i) = \text{NN}*{L-1}(i)]$$

1.0 means every token kept the same nearest neighbor — a perfectly locked plateau. 0.0 means every token's nearest neighbor changed — the cloud is still reorganizing.

CKA is invariant to rotation and measures aggregate relational geometry. Two failure modes it cannot see:

Because each token has exactly one nearest neighbor, the NN map defines a functional graph (every node has out-degree 1). The cycles of this graph are the stable atoms: a mutual pair $i \leftrightarrow j$ is a 2-cycle, and longer cycles are tightly bound groups. At plateau layers, these cycles are the operational definition of a "stable cluster" used in the token-membership reporting — and they are tagged as **semantic** (members are distinct token strings — structurally parallel tokens) or **duplicate** (members are the same token string — positional copies of one word). The semantic-vs-duplicate split keeps positional repeats from being counted as meaningful clusters.

When the cloud collapses to a near-point-mass, every token is nearly equidistant from every other, and which one is "nearest" is decided by floating-point noise at the $10^{-6}$ level. NN-stability computed there is meaningless. So, like CKA, it is suppressed below the rank threshold.

The code: NN indices, the per-layer stability, and degeneracy suppression

The nearest neighbor of each token is one masked argmax on the Gram matrix:

``` php
def nearest_neighbor_indices(G: np.ndarray) -> np.ndarray:    """nn[i] = argmax_{j≠i} G[i, j], from a pre-computed Gram matrix."""    G_masked = G.copy()    np.fill_diagonal(G_masked, -np.inf)   # exclude self (G[i,i] = 1)    return np.argmax(G_masked, axis=1).astype(np.int32)
```

Stability is the fraction unchanged from the previous layer, computed inline in the loop (layer 0 has no predecessor, so it is `None`

):

```
nn = nearest_neighbor_indices(G)lr["nn_indices"] = nn.tolist()if prev_nn is not None:    lr["nn_stability"] = float(np.mean(nn == prev_nn))   # the indicator-mean formulaelse:    lr["nn_stability"] = Noneprev_nn = nn
```

When NN-stability plateaus are extracted, degenerate layers are dropped first — the same rank gate as CKA, on the unified constant:

```
nn_stab_defined = [    (i, v) for i, v in enumerate(nn_stab_vals)    if v is not None and layers[i]["effective_rank"] >= DEGENERATE_RANK_THRESHOLD]nn_series = [v for _, v in nn_stab_defined]nn_plateaus = detect_plateaus(nn_series, window=2, tol=0.02)
```

The semantic/duplicate split is read off the NN functional graph at plateau layers: cycles whose members are distinct token strings are semantic; cycles of identical strings are positional duplicates. That tagging is what lets the report distinguish a real structural cluster from the same word appearing three times.

**What came out:** NN-stability plateaus coincide with CKA plateaus on the real prompts — the discrete and continuous measures agree on where the metastable windows are, which is the cross-check that makes the persistence claim credible rather than an artifact of one metric. At those layers the stable NN-cycles are predominantly semantic (distinct tokens drawn together) rather than duplicate, so the plateaus reflect structural grouping, not repeated tokens. Degenerate late layers are suppressed, so collapsed models do not report spurious perfect stability.

*A plateau is a contiguous run of layers over which a signal stays approximately flat. This routine turns the continuous traces above into concrete metastable windows with a start, an end, and a width.*

**Why this instrument:** "Metastability" only becomes testable once "the structure persists for a while" is an actual interval of layers. Plateau identification is the bridge from traces to windows. It is deliberately simple and applied identically to every signal, so a plateau in mass-near-1, a plateau in CKA, and a plateau in cluster count are detected by the same rule and can be compared.

<details> <summary>What counts as a plateau, and how the signals are (and aren't) combined</summary>

A window of layers is a plateau if the signal's relative span across it stays below a tolerance. For a segment of values $v$, the criterion is

Using *relative* span (dividing by the mean) makes the criterion scale-appropriate per signal — a tolerance that is sensible for a cluster count of 6 would be far too loose for a CKA value of 0.97. The detector starts from a minimum width, then greedily extends the window as long as the flatness criterion still holds, and records `(start, end, mean)`

.

Because the signals live on different scales, each gets its own tolerance:

Signal | tol | Why | |
|---|---|---|---|
mass-near-1 | 0.10 | fraction in [0,1], moves in chunks at merges | |
effective rank | 0.05 | smooth, want tight flatness | |
spectral-k / hdbscan-k | 0.5 | integer counts; 0.5 tolerates no real change | |
CKA | 0.02 | near 1 inside a plateau; should be very flat | |
NN-stability | 0.02 | same reasoning as CKA |

This is worth stating precisely, because the intent and the implementation differ.

The **authoritative** plateau set stored on each run (`plateau_layers`

, step P1-7) is computed from **mass-near-1 alone**, with tol 0.10. That single signal defines the windows that downstream analyses (semantic coherence at plateau layers, the two-timescale numerator) actually use.

Separately, a **multi-signal vote** is computed: for each layer, count how many of {mass, rank, spectral-k, hdbscan-k, fiedler, CKA} have a plateau covering it, and flag layers covered by ≥2. This stricter "joint" set is used in the flagged-anomalies cross-check, not as the operational plateau set.

So the "we require all signals jointly" framing is aspirational relative to the current code: the operational windows are mass-near-1-driven, with the multi-signal agreement serving as a confirmation layer. The Fix 7 sensitivity script goes further — it sweeps a genuine joint criterion (`cka_thresh`

, `nn_thresh`

, `vote_frac`

, `min_len`

) to measure how much the plateau widths move under different definitions. The width uncertainty from that sweep is what feeds the error band on the two-timescale ratio. Until that rerun lands, plateau widths should be read as mass-near-1 windows, not consensus windows.

The *onset* layer of the first plateau, compared across prompts for a fixed model, is itself a diagnostic. If onset barely moves with the prompt (SD < 2 layers), the plateau is a weight-level property — the architecture clusters at a fixed depth regardless of input. If onset moves a lot (SD ≥ 2), clustering is content-driven. This is the measurement behind the "attractor-driven vs. content-driven" split in the synthesis.

</details> <details> <summary>The code: the detector, the operational set, and the joint vote</summary>

The detector is one greedy scan, identical for every signal:

``` php
def detect_plateaus(values: list, window: int = 2, tol: float = 0.05) -> list:    """Contiguous windows where relative span < tol. Returns (start, end, mean)."""    plateaus = []    n = len(values)    i = 0    while i < n - window:        segment = values[i:i + window + 1]        span    = max(segment) - min(segment)        ref     = abs(np.mean(segment)) + 1e-8        if span / ref < tol:                       # the flatness criterion            j = i + window            while j < n - 1:                       # greedily extend                extended = values[i:j + 2]                if (max(extended) - min(extended)) / (abs(np.mean(extended)) + 1e-8) < tol:                    j += 1                else:                    break            plateaus.append((i, j, float(np.mean(values[i:j + 1]))))            i = j + 1        else:            i += 1    return plateaus
```

The operational plateau set is mass-near-1 only (this is the `plateau_layers`

every downstream consumer reads):

```
# Post-loop (P1-7): the authoritative plateau setmass1    = [r["ip_mass_near_1"] for r in results["layers"]]plateaus = detect_plateaus(mass1, window=2, tol=0.10)plateau_layer_set = set()for s, e, _ in plateaus:    for l in range(s, e + 1):        plateau_layer_set.add(l)results["plateau_layers"] = sorted(plateau_layer_set)
```

The multi-signal vote is computed separately, as a stricter cross-check (note it is *not* what `plateau_layers`

uses):

```
layer_count = Counter()for group in [plateaus_mass, plateaus_rank, plateaus_spk, plateaus_hdb, plateaus_fied]:    for s, e, _ in group:        for l in range(s, e + 1):            layer_count[l] += 1for s, e, _ in plateaus_cka:                       # CKA indexed by position → remap to layer    for idx in range(s, e + 1):        layer_count[cka_pairs[idx][0]] += 1multi = [l for l, c in sorted(layer_count.items()) if c >= 2]   # ≥2 signals agree
```

Prompt sensitivity turns the onset layer into the weight-level/content-driven verdict:

```
onset_vals = [first_plateau_onset(run) for run in runs_of_this_model]sd = float(np.std(onset_vals))classification = "weight-level" if sd < 2.0 else "content-driven"
```

</details>

**What came out:** Plateaus are reproducible — the same model and prompt yield the same windows across runs. The prompt-sensitivity split is the load-bearing result: the GPT-2 family clusters at weight-level onset (low SD across prompts — the architecture has a fixed clustering depth), while BERT's onset is content-driven (it moves with the input). This is the empirical basis for the two-regime story in the synthesis. The honest caveat carried forward: plateau *widths* are currently mass-near-1 windows; the consensus-window definition and its uncertainty band are pending the Fix 7 sensitivity rerun, and the two-timescale ratio inherits that uncertainty.

*Following individual HDBSCAN clusters layer by layer: matching each cluster to its continuation in the next layer, and recording when one is born, dies, or merges into another.*

**Why this instrument:** The plateau signals say *that* structure persists. Trajectory tracking says *which* structure persists and *how it ends*. The slow timescale of metastability is literally a sequence of pairwise merges — two clusters becoming one — and this is the instrument that records each merge as a discrete event at a specific layer transition. It is also the only correct merge-detection method for ALBERT-xlarge, where the spectral cluster count degenerates (see Group A): the trajectory tracker works from token membership, not from the Laplacian spectrum, so it still records merges when the spectral method reports a flat $k=1$.

<details> <summary>How clusters are matched across layers, and what a merge is</summary>

At layer $L$ HDBSCAN found some clusters; at layer $L+1$ it found some clusters. Which layer-$L$ cluster *is* which layer-$(L{+}1)$ cluster? They are the same cluster if they contain mostly the same tokens. The overlap measure is Jaccard:

computed over token membership, with HDBSCAN noise tokens (label $-1$) excluded.

Given the full matrix of Jaccard overlaps between every layer-$L$ cluster and every layer-$(L{+}1)$ cluster, the optimal one-to-one matching is the assignment that maximizes total overlap. That is the Hungarian algorithm (`linear_sum_assignment`

), run on the *negated* overlap matrix because the routine minimizes cost. Matches below a minimum Jaccard (0.1) are discarded as too weak to be continuations.

So the description is both "Jaccard overlap" and "Hungarian matching": Hungarian is the assignment procedure, Jaccard is the cost it optimizes.

After the optimal matching:

Matched clusters are chained across layers into trajectories. Each trajectory is a sequence of `(layer, cluster_id)`

pairs with a birth layer, a death (or end) layer, and a lifespan. Trajectory lifespans are the per-cluster version of the plateau width: a long-lived trajectory is a cluster that survived a long metastable stretch before merging.

Group A explained that ALBERT-xlarge's Laplacian has a dominant zero mode that pins the spectral cluster count at $k=1$ regardless of real structure. A merge-detection method based on spectral-$k$ drops would therefore report zero merges there — an artifact. Trajectory tracking works entirely from HDBSCAN token membership, which has no such failure mode, so its merge counts are the authoritative ones for ALBERT-xlarge. Any `nMerges = 0`

in a spectral-sourced column for that model is the spectral artifact, not an absence of merges.

</details> <details> <summary>The code: Hungarian-on-Jaccard matching and merge detection</summary>

The overlap matrix is Jaccard over membership sets, noise excluded:

``` python
def _jaccard_overlap_matrix(labels_a, labels_b):    ids_a = sorted(set(labels_a) - {-1})    ids_b = sorted(set(labels_b) - {-1})    sets_a = {c: set(np.where(labels_a == c)[0]) for c in ids_a}    sets_b = {c: set(np.where(labels_b == c)[0]) for c in ids_b}    overlap = np.zeros((len(ids_a), len(ids_b)))    for i, ca in enumerate(ids_a):        for j, cb in enumerate(ids_b):            inter = len(sets_a[ca] & sets_b[cb])            union = len(sets_a[ca] | sets_b[cb])            overlap[i, j] = inter / union if union else 0.0    return overlap, ids_a, ids_b
```

The matching is Hungarian on the negated overlap, then merges are the unmatched-prev-into-matched-curr case:

``` python
from scipy.optimize import linear_sum_assignment# maximum-weight assignment → minimize negated overlap (padded to square)cost = np.zeros((size, size)); cost[:n_prev, :n_curr] = -overlaprow_ind, col_ind = linear_sum_assignment(cost)matches = []for r, c in zip(row_ind, col_ind):    if r < n_prev and c < n_curr and overlap[r, c] >= min_jaccard:   # min_jaccard = 0.1        matches.append((ids_prev[r], ids_curr[c], float(overlap[r, c])))# merge: an unmatched prev cluster overlapping a curr cluster that is already matchedfor up in unmatched_prev:    best_j = int(np.argmax(overlap[ids_prev.index(up), :]))    target = ids_curr[best_j]    if overlap[..., best_j] >= min_jaccard and target in matched_curr:        # record (prev clusters that fed `target`, target) as one merge event
```

Trajectories are chains of matched `(layer, cluster_id)`

tips, advanced one transition at a time. The run-level summary is what the report prints:

```
"summary": {    "total_births":  total_births,    "total_deaths":  total_deaths,    "total_merges":  total_merges,    "max_alive":     max_alive,    "n_trajectories": len(traj_info),    "mean_lifespan":  float(np.mean([t["lifespan"] for t in traj_info])),    "max_lifespan":   max(t["lifespan"] for t in traj_info),}
```

</details>

**What came out:** Merges are recorded as discrete events at specific layer transitions, and they accumulate with depth — the signature of the slow timescale. For ALBERT-base on a Wikipedia paragraph, the tracker finds on the order of 287 trajectories at 48 iterations with a mean lifespan around 5 layers, a max lifespan near 27, and roughly 57 merge events, with merge transitions concentrated in the early-to-mid layers and a long tail. The short-heterogeneous prompt produces far fewer, longer-lived trajectories (around 20, max lifespan 21) — fewer tokens, cleaner clusters, slower merging. ALBERT-xlarge trajectories are shorter and noisier, consistent with its slower, not-yet-collapsed dynamics, and its merge counts come from this tracker rather than the degenerate spectral-$k$. The merge sequence is one of the artifacts handed forward: the layers where clusters merge are candidate sites for the Phase 2 energy-violation cross-referencing.

*Code excerpts are from *`core/metrics.py`

* (*`linear_cka`

*, *`nearest_neighbor_indices`

*), *`analysis.py`

* (per-layer loop, P1-7 plateau set), *`reporting.py`

* (*`detect_plateaus`

*, multi-signal vote, prompt sensitivity), and *`cluster_tracking.py`

* (*`_jaccard_overlap_matrix`

*, *`match_layer_pair`

*, *`track_clusters`

*). Snippets are lightly trimmed but otherwise verbatim. Trajectory counts are from the pre-fix cross-run report and will be re-derived after the Fixes 1–8 rerun.*

Finding — CONFIRMED.CKA shows reproducible flat-near-1 runs punctuated by sharp drops, and the discrete NN-stability plateaus coincide with them on the real prompts — the continuous and discrete measures agree on where the metastable windows are, which is the cross-check that makes the persistence claim credible rather than an artifact of one metric. At those layers the stable nearest-neighbor cycles are predominantly *semantic* (distinct tokens drawn together), not positional duplicates, so the plateaus reflect structural grouping. Trajectory tracking records merges as discrete events that accumulate with depth — the signature of the slow timescale (for ALBERT-base on a Wikipedia paragraph: on the order of 287 trajectories at 48 iterations, mean lifespan ~5 layers, ~57 merges). The prompt-sensitivity split is the load-bearing result: GPT-2 onset is weight-level (low SD across prompts), BERT onset is content-driven.Caveat (PROVISIONAL):plateau *widths* are currently mass-near-1 windows, not consensus windows; the joint-criterion definition and its uncertainty band await the Fix 7 sensitivity rerun, and the two-timescale ratio inherits that uncertainty.

*Question: does the attention mechanism itself show the structure we'd expect if clustering is happening — or are the geometric clusters invisible to attention?*

Groups A and B looked at the token cloud: where tokens are on the sphere and whether that arrangement persists. Group C looks at the other half of the layer — the attention matrix — and asks whether it carries the same story. This matters because in the theory, attention is not a bystander. Attention *is* the force that pulls tokens together; the clustering in the residual stream is supposed to be the consequence of attention routing similar tokens toward each other. So if the geometric clusters from Group A are real, attention should show a matching signature: concentrated, cluster-respecting routing rather than uniform mixing. If attention looks uniform while the geometry clusters, the two are decoupled and the mechanistic picture is wrong.

The three instruments build up in specificity. Attention entropy asks the coarsest version — is attention concentrating at all? The Sinkhorn–Fiedler analysis asks whether the concentrated attention forms a *cluster-separated graph* matching the geometric clusters. The per-head classification asks *which heads* do this and whether they do it as a fixed property of the weights or in response to the input.

One structural caution sits over the entire group and is stated again at each instrument: GPT-2's attention is causally masked, and the mask alone can manufacture a cluster-like attention graph independent of content. The routing results for the GPT-2 family are therefore suspended pending the causal-mask baseline. The code to run that baseline exists (Fix 3); the runs do not yet.

*For each attention head, the Shannon entropy of each row of the attention matrix, averaged over rows. Low entropy means each token attends to a few others; high entropy means it spreads attention evenly.*

**Why this instrument:** This is the most direct read on whether attention is concentrating. A row of the attention matrix is a probability distribution over which tokens this token attends to. If clustering is happening, that distribution should be peaked — tokens should attend to their cluster, not to everyone. Entropy measures exactly how peaked. It also connects to the theory's temperature parameter: in the paper, the inverse temperature $\beta$ controls how sharply attention concentrates, and the metastability conjecture holds *for large $\beta$*. Falling attention entropy across layers is the empirical signature of the model operating in the high-$\beta$ regime where metastability is predicted to appear.

<details> <summary>What attention entropy measures, and its link to the theory's β</summary>

An attention matrix row $A_{i,:}$ is a probability distribution: $A_{ij} \geq 0$ and $\sum_j A_{ij} = 1$ (softmax output). Its Shannon entropy is

Averaging over the $n$ rows gives a per-head scalar. The bounds are interpretable:

So entropy runs from $\log n$ (no structure) down to $0$ (maximally concentrated), and watching it fall across layers tells you attention is sharpening.

In the paper's dynamics, attention weights are $\propto \exp(\beta \langle x_i, x_j \rangle)$. The inverse temperature $\beta$ controls sharpness: at $\beta = 0$ attention is uniform (maximum entropy); as $\beta \to \infty$ attention becomes a hard argmax (zero entropy). So attention entropy is a monotone proxy for the *effective* $\beta$ the model is operating at — low entropy corresponds to high effective $\beta$.

This is why entropy is the right opener for Group C. The metastability conjecture in the paper is a large-$\beta$ phenomenon. If measured attention entropy is high and flat (effective $\beta$ near zero), there is no reason to expect metastability and the geometric clustering would need a different explanation. If entropy falls with depth, the model is moving into the regime where the theory's prediction applies, and the geometric clustering from Group A has a mechanism behind it.

Entropy is computed per head because heads specialize — some concentrate, some stay diffuse, and an average over heads would hide that. Group C reports the per-head values and uses the mean only as a summary. The per-head spread is itself the subject of C.3.

</details> <details> <summary>The code: entropy of the softmax rows</summary>

The entire instrument is one vectorized function over all heads at once:

``` php
def attention_entropy(attn_matrix: torch.Tensor) -> np.ndarray:    """    Shannon entropy of each attention row, averaged over tokens.    attn_matrix : (n_heads, n_tokens, n_tokens)    Returns (n_heads,) — mean entropy per head.    """    attn              = attn_matrix.numpy()    log_attn          = np.log(attn + 1e-12)                 # 1e-12 guards log(0)    entropy_per_token = -(attn * log_attn).sum(axis=-1)      # (n_heads, n_tokens)    return entropy_per_token.mean(axis=-1)                    # (n_heads,)
```

`-(attn * log_attn).sum(axis=-1)`

is $H_i$ for every row of every head; `.mean(axis=-1)`

averages over the $n$ rows. The two reference points are pinned by tests: uniform attention returns $\log n$ and identity (each token attends only to itself) returns $0$, confirming the scale runs from no-structure to fully-concentrated as described.

</details>

**What came out:** *(Provisional — pending the full rerun.)* The expected and observed pattern is attention entropy falling with depth across models on real prompts: attention sharpens as the residual stream clusters, placing the models in the high-effective-$\beta$ regime where metastability is predicted. The fall is not monotone everywhere — local rises at reorganization layers are expected, and they tend to line up with the plateau-ending merges from Group B. The numbers here will be re-derived from the rerun; the qualitative claim (entropy concentrates with depth) is the load-bearing one and is robust to the open code questions.

*We make each attention matrix doubly stochastic (Sinkhorn–Knopp), treat it as a weighted graph, and compute the Fiedler value — the second-smallest Laplacian eigenvalue. Low Fiedler means the attention graph is nearly disconnected: tokens are routed into separate clusters. High Fiedler means tokens mix freely.*

**Why this instrument:** Entropy says attention is concentrating, but not *into what shape*. Concentrated attention could still mix all tokens (a hub everyone attends to) or could split them into separated groups. The Fiedler value distinguishes these. It is the graph-theoretic measure of how close a graph is to falling apart into disconnected components — exactly the cluster-separation structure that would confirm attention is implementing the geometric clusters from Group A. The Sinkhorn step first puts the attention matrix into the doubly stochastic form that the paper's Section 3.3 identifies as the gradient-flow object, so the Fiedler value is computed on the theoretically meaningful normalization rather than raw attention.

<details> <summary>Why doubly stochastic, what the Fiedler value is, and how to read it</summary>

Raw attention is row-stochastic (each row sums to 1) but not column-stochastic — some tokens receive far more attention than others. The paper's Section 3.3 connects the idealized dynamics to a *doubly* stochastic object (both rows and columns sum to 1), which is the symmetric, balanced form a gradient flow would produce. The gap between raw attention and its doubly stochastic version is itself a diagnostic ("row/col balance" below): zero means attention is already balanced, large means it is far from the idealized form.

Sinkhorn–Knopp produces the doubly stochastic matrix by alternately normalizing rows and columns until both converge. It is the standard, provably convergent way to get there.

Treat the doubly stochastic matrix $P$ as a weighted graph (after symmetrizing, $(P + P^\top)/2$). The graph Laplacian $L = D - W$ encodes its connectivity. The eigenvalues of the normalized Laplacian start at $\lambda_1 = 0$ (always — the trivial constant mode) and increase. The second one, $\lambda_2$, is the **Fiedler value** (algebraic connectivity):

The number of near-zero eigenvalues equals the number of near-disconnected components, which is why a related eigengap reading gives a cluster count (below).

The Group C question is whether attention *knows about* the geometric clusters. The Fiedler value makes that testable directly: if low-Fiedler layers (cluster-separated attention) coincide with the layers where Group A finds geometric clusters and Group B finds plateaus, then attention and geometry are telling the same story. The Spearman cross-check below quantifies exactly this coincidence.

The same spectrum gives a cluster count: for a $k$-cluster structure, the $k$ largest eigenvalues of $P$ sit near 1 with a drop below them. The largest gap in the descending eigenvalue sequence locates $k$ without a hard threshold. (This is the Fix 8 replacement for an earlier hard $> 0.5$ rule; it falls back to the hard threshold only when no clear gap exists, i.e. on near-uniform post-collapse matrices.)

</details> <details> <summary>The code: Sinkhorn iteration, Fiedler value, and the eigengap count</summary>

Sinkhorn–Knopp is alternating row/column normalization to convergence, batched across all heads:

``` python
def sinkhorn_normalize_batched(A, max_iter=SINKHORN_MAX_ITER, tol=SINKHORN_TOL):    """A: (n_heads, n, n) raw attention → doubly stochastic per head."""    P = np.clip(A.astype(np.float64), 1e-12, None)    for _ in range(max_iter):        P_prev = P.copy()        P /= P.sum(axis=2, keepdims=True)   # row-normalise all heads        P /= P.sum(axis=1, keepdims=True)   # col-normalise all heads        if np.abs(P - P_prev).max() < tol:            break    return P
```

The Fiedler value is $\lambda_2$ of the normalized Laplacian of the symmetrized matrix:

``` php
def fiedler_value(P: np.ndarray) -> float:    """λ₂ of the normalised Laplacian. λ₂≈0 → cluster-separated; large → mixing."""    P_sym       = (P + P.T) / 2    L           = laplacian(P_sym, normed=True)    k           = min(3, P.shape[0] - 1)    eigenvalues = eigh(L, eigvals_only=True, subset_by_index=[0, k - 1])    return float(eigenvalues[1]) if len(eigenvalues) > 1 else 0.0
```

The cluster count reads the eigengap (Fix 8), falling back to a hard threshold only when there is no real gap:

``` php
def sinkhorn_cluster_count(P, min_gap_ratio: float = 0.1) -> int:    # k largest eigenvalues near 1; largest gap in the descending sequence = k.    # Fallback to hard >0.5 count when largest gap < min_gap_ratio*(λmax−λmin),    # i.e. on near-uniform (post-collapse) matrices with no genuine structure.    ...
```

The per-layer summary records the mean Fiedler, the per-head Fiedler list, the eigengap cluster count, and the row/col balance (distance of raw attention from doubly stochastic):

```
result = {    "fiedler_mean":                float(np.mean(fiedler_vals)),    "fiedler_per_head":            fiedler_vals,    "sinkhorn_cluster_count_mean": float(np.mean(cluster_counts)),    "row_col_balance_mean":        float(np.mean(row_col_balance)),}
```

**Caveat carried from the plateau detector.** Fiedler plateaus are found with `detect_plateaus(fiedler, tol=0.05)`

, and that routine measures *relative* flatness (`span / (|mean| + 1e-8)`

). The cluster-separated state is exactly Fiedler ≈ 0, where the denominator collapses and the relative criterion can fail to register a genuinely flat-near-zero window. Until the detector gains an absolute-tolerance branch, low-Fiedler plateaus may be under-detected — read the raw Fiedler trace alongside the detected plateaus.

</details> <details> <summary>The cross-check: does low Fiedler co-occur with high cluster count?</summary>

The single most direct answer to "does attention know about the clusters" is the correlation between the attention-graph Fiedler value and the geometric (HDBSCAN) cluster count, layer by layer. If attention is implementing the clustering, low Fiedler (separated attention) should co-occur with high cluster count (multi-cluster geometry) — a negative correlation.

```
fied_hdb_pairs = [    (r["sinkhorn"]["fiedler_mean"],     r["clustering"]["hdbscan"]["n_clusters"])    for r in layers    if "sinkhorn" in r and not np.isnan(r["clustering"]["hdbscan"]["n_clusters"])]rho, pval = spearmanr([p[0] for p in fied_hdb_pairs],                      [p[1] for p in fied_hdb_pairs])# rho < -0.4 → attention Fiedler tracks geometric cluster count (signal)# rho ≈ 0    → attention and geometry are telling different stories
```

A Spearman $\rho < -0.4$ is the threshold the report uses to call this an interpretable signal; $\rho \approx 0$ would mean the attention graph and the token geometry are decoupled, which would undercut the mechanistic claim even though both individually show clustering.

</details>

**What came out:** *(Provisional.)* The expected result is low-Fiedler windows coinciding with the geometric cluster windows from Group A, and a negative Fiedler–cluster-count correlation strong enough to read as signal rather than noise. Where that holds, attention and geometry are confirmed to be the same phenomenon seen from two sides. The honest caveats: the Fiedler plateau detection may under-report near-zero windows until the detector is fixed, and any low-Fiedler reading on GPT-2 is confounded by the causal mask (next instrument). The Spearman value is the number to watch in the rerun — it is the cleanest single test of the Group C question.

*Each head is classified by its typical Fiedler value across the active layers: CLUSTER (consistently routes into separated clusters), MIXED (variable), or MIXING (consistently lets tokens mix). Whether a head's classification is stable across prompts separates weight-level routing from content-driven routing.*

**Why this instrument:** The mean Fiedler hides the division of labor. Some heads may be dedicated cluster-formers while others mix; averaging erases that. Per-head classification recovers it. And comparing a head's behavior *across prompts* answers the deeper question: is a head's routing role a fixed property of its learned weights (same classification regardless of input — STABLE), or does it depend on the content (VARIABLE)? That STABLE-vs-VARIABLE split is the attention-side version of the weight-level-vs-content-driven distinction Group B found in plateau onsets, and it feeds the two-regime synthesis.

<details> <summary>How heads are classified, the active-phase restriction, and the causal-mask problem</summary>

For each head, collect its Fiedler value at every active-phase layer, take the mean, and bin it:

Classification is restricted to layers where effective rank ≥ 10. The reason is a saturation artifact: once tokens collapse to a near-point-mass (rank below ~10), there is only one cluster, the doubly stochastic attention matrix is nearly uniform, and the Laplacian has no gap — so *every* head's Fiedler trivially saturates to ≈ 1.0. Including those layers would pull every head's mean toward 1.0 and label everything MIXING regardless of its real role. Excluding them keeps the classification meaningful. (Note this rank-10 threshold is specific to this analysis and separate from the rank-2 degeneracy gate used for CKA/NN — it is a larger cutoff because Fiedler saturation sets in well before full point-mass collapse. It is currently hardcoded in the profiler and would be cleaner in config.)

A head that lands in the same class for every prompt is STABLE — its routing role is a property of the weights, input-independent. A head whose class changes with the prompt is VARIABLE — content-driven. The cross-prompt consistency table reports this per head per model.

This is the central caution of Group C. GPT-2 attention is causally masked (lower-triangular). Sinkhorn-normalizing and symmetrizing a lower-triangular matrix forces a low-connectivity graph *regardless of content* — which would manufacture low Fiedler, and therefore a "100% STABLE-CLUSTER" reading across every prompt, as an artifact of the mask rather than a fact about the weights.

The fix has two parts, both implemented (Fix 3), neither yet run:

Until both runs are done, the GPT-2 routing result is suspended.

</details> <details> <summary>The code: deviation-based classification and the two causal controls</summary>

The main analysis computes raw per-head Fiedler, and for causal models also the mask-only baseline and each head's deviation from it:

```
fiedler_vals = [fiedler_value(P_all[h]) for h in range(n_heads)]result["fiedler_per_head"] = fiedler_vals# Fix 3a — causal-mask baseline subtractionif is_causal:    baseline = causal_fiedler_baseline(n)               # Fiedler of uniform-causal attn    result["fiedler_baseline"]           = baseline    result["fiedler_per_head_deviation"] = [round(f - baseline, 6) for f in fiedler_vals]
```

`causal_fiedler_baseline`

is the content-free reference — the Fiedler value the mask produces on its own:

``` php
def causal_fiedler_baseline(n: int) -> float:    A_base = _uniform_causal_attention(n)   # uniform within the lower triangle    P_base = sinkhorn_normalize(A_base)    return fiedler_value(P_base)            # what the mask alone forces
```

The BERT control applies an artificial causal mask to a bidirectional model to test whether masking alone collapses heads to CLUSTER:

```
# Fix 3b — causal-mask control (run on BERT)if apply_causal_control:    causal_mask = np.tril(np.ones((n, n)))    attn_masked = attn * causal_mask[None, :, :]            # zero upper triangle    attn_masked /= attn_masked.sum(axis=2, keepdims=True)   # renormalise rows    P_ctrl = sinkhorn_normalize_batched(attn_masked)    result["fiedler_causal_control_per_head"] = [fiedler_value(P_ctrl[h]) for h in range(n_heads)]
```

The classifier then chooses the signal based on model type — deviation for causal models, raw Fiedler otherwise — and restricts to the active phase:

```
# in _per_head_fiedler_profile, active phase = effective_rank >= 10# causal models classify on fiedler_per_head_deviation; others on raw fiedler_per_headclassification = ("CLUSTER" if mean < 0.3 else                  "MIXED"   if mean < 0.7 else                  "MIXING")
```

</details>

**What came out:** *(Provisional, and partly suspended.)* For BERT, heads are expected to be VARIABLE — routing roles shift with content, consistent with the content-driven plateau onsets from Group B. For ALBERT-base, head behavior transitions with iteration depth, which the weight-sharing architecture makes natural to read as a time axis. For the **GPT-2 family, the result is suspended**: the apparent "100% STABLE-CLUSTER" reading is exactly what the causal mask would produce on its own, so it cannot be attributed to the weights until the baseline-subtracted classification and the BERT causal-mask control have been run. The deviation-based numbers and the control verdict are the outputs to generate in the rerun; the raw STABLE-CLUSTER reading should not be reported as a finding before then.

Attention concentrates with depth (C.1), and where it concentrates it forms cluster-separated graphs that — provisionally — track the geometric clusters from Group A (C.2). The per-head picture (C.3) shows a division of labor that is content-driven in BERT and depth-staged in ALBERT-base. The one result Group C cannot yet deliver is the GPT-2 routing claim, because the causal mask confounds the Fiedler value and the disambiguating runs are still pending. The Group C contribution that carries forward regardless of how the GPT-2 question resolves is the per-head split itself: attention routing is not uniform across heads, and the CLUSTER/MIXING distinction is a standing architectural fact the later phases build on.

*Code excerpts are from *`core/metrics.py`

* (*`attention_entropy`

*), *`sinkhorn.py`

* (*`sinkhorn_normalize_batched`

*, *`fiedler_value`

*, *`sinkhorn_cluster_count`

*, *`causal_fiedler_baseline`

*, *`analyze_attention_sinkhorn`

*), and *`reporting.py`

* (*`_per_head_fiedler_profile`

*, Fiedler–cluster Spearman cross-check). Snippets are lightly trimmed but otherwise verbatim. All findings are provisional pending the Fixes 1–8 rerun; the GPT-2 routing result is suspended pending the Fix 3 causal-mask baseline and BERT control runs.*

> **Finding — PARTIAL / SUSPENDED.** *(Provisional pending the full rerun.)* Attention entropy falls with depth across models on real prompts — attention sharpens as the residual stream clusters, placing the models in the high-effective-β regime where metastability is predicted; local rises tend to line up with the plateau-ending merges from Group B. Low-Fiedler windows are expected to coincide with the geometric cluster windows from Group A, with a negative Fiedler–cluster-count Spearman correlation as the cleanest single test of "attention knows about the clusters." The per-head picture is content-driven in BERT and depth-staged in ALBERT-base. **The GPT-2 routing result is suspended** pending the Fix 3 causal-mask baseline and BERT causal-mask control — until those run, the raw "stable cluster-routing" reading is not reported as a finding. The contribution that carries forward regardless: attention routing is *not* uniform across heads, and the CLUSTER/MIXING split is a standing architectural fact the later phases build on.

*Question: the theory predicts that a specific interaction energy increases monotonically across layers, as if the model were doing gradient ascent toward consensus. Does it?*

This is the group where the theory makes its sharpest, most falsifiable prediction — and where the trained models break it, universally. Groups A through C asked whether trained transformers *look like* the idealized dynamics: do they cluster, persist, route attention into clusters? The answers were mostly yes. Group D asks whether they *are* those dynamics in the strict sense the theory requires, and the answer is no. There is a specific scalar — the interaction energy — that the idealized gradient flow must drive monotonically upward at every step. In every model, on every prompt, that scalar goes down somewhere. The clustering is real, but it is not being produced by the pure-attraction mechanism the theory describes.

That makes Group D the seam of the whole project. The two instruments here are not just measuring whether the energy is monotone; the second one localizes *where* and *on which token pairs* the violation happens, which is the evidence Phase 2 picks up to ask what mechanism — the value matrix's mixed-sign spectrum — is responsible.

*At each layer we compute a single scalar from the Gram matrix — the interaction energy. The theory says it must increase from one layer to the next. We check whether it does.*

**Why this instrument:** The paper's dynamics are a gradient flow: the tokens move so as to *ascend* an energy landscape, and the interaction energy is that landscape. Monotone increase is not an incidental property — it is the definition of what "gradient ascent toward consensus" means. So this is the most direct test of whether trained attention implements the idealized dynamics. Unlike clustering (which many mechanisms could produce), strict monotonicity is a signature only gradient ascent on this particular energy produces. A single downward step falsifies it.

<details> <summary>The energy, the sign convention, and what a violation actually means</summary>

For tokens on the sphere with Gram matrix $G_{ij} = \langle x_i, x_j \rangle$, the interaction energy at inverse temperature $\beta$ is

Read the structure: every term $e^{\beta \langle x_i, x_j \rangle}$ is large when tokens $i$ and $j$ point the same way (inner product near $+1$) and small when they point apart. So $E_\beta$ is **large when tokens are aligned** and small when they are spread out. Maximal clustering — every token identical — maximizes it. The analytical reference values make this concrete: for a fully collapsed cloud $E_\beta = e^\beta / 2\beta$, for two antipodal clusters $\cosh(\beta)/2\beta$, for a uniform spread $\approx 1/2\beta$, and $e^\beta > \cosh\beta > 1$, so collapsed > antipodal > uniform.

The idealized dynamics ascend this energy. As depth increases and tokens cluster, $E_\beta$ should rise at every layer. Plotted against layer index, it should be a non-decreasing curve. This is the gradient-ascent picture: the model is climbing toward consensus, and the energy is the height it has climbed.

A **violation** is a layer where $E_\beta$ *decreases*: $E_{\text{curr}} < E_{\text{prev}}$. Mechanistically this is sharp. For the energy to fall, some pairs must have their $e^{\beta \langle x_i, x_j \rangle}$ terms shrink, which means those tokens moved *apart* — their inner product dropped. That is a **locally repulsive move**.

The pure-attraction dynamics in the paper *cannot* do this. Attention only pulls tokens together; it has no repulsive term. Under the idealized flow, the energy is a Lyapunov function — it can only go up. So an energy drop is not a small quantitative deviation; it is qualitative evidence of a force the theory does not contain. Something in the trained layer pushed two tokens apart. Group D's job is to detect that, and D.2's job is to find the tokens it happened to.

This is a methodological point that the result depends on. $e^{\beta \langle x_i, x_j \rangle}$ grows fast with $\beta$: at $\beta = 5$ the energy values are large, and so is the float32 noise on them. An *absolute* threshold for "did the energy drop" (the old $-10^{-4}$ / $-10^{-6}$ gates) is scale-blind — at large $\beta$ it fires on numerical noise, manufacturing violations that aren't real. Fix 2 replaced it with a **relative** criterion:

\frac{E_{\text{prev}} - E_{\text{curr}}}{|E_{\text{prev}}|} > \text{rel_tol}, \qquad \text{rel_tol} = 10^{-3}.

A violation now means the energy fell by more than 0.1% of its own magnitude — a real drop, not a rounding artifact, at any $\beta$. This matters for interpreting the universality claim: a violation that survives the relative threshold is trustworthy; the claim that violations occur at *every* $\beta$, including large $\beta$, is precisely what the relative criterion exists to test, because under the old absolute gate large-$\beta$ violations were partly guaranteed by construction.

</details> <details> <summary>The code: the energy and the relative violation test</summary>

The energy is computed for all $\beta$ in one vectorized pass over the cached Gram matrix:

``` php
def interaction_energies_batched(G: np.ndarray, beta_values: list) -> dict:    """E_beta = (1 / 2β n²) Σ_ij exp(β ⟨x_i, x_j⟩), for every β at once."""    n        = G.shape[0]    betas    = np.asarray(beta_values, dtype=np.float64)   # (B,)    exp_G    = np.exp(betas[:, None, None] * G[None])       # (B, n, n)    sums     = exp_G.sum(axis=(1, 2))                       # (B,)    energies = sums / (2.0 * betas * n * n)                 # (B,)    return {float(beta): float(e) for beta, e in zip(beta_values, energies)}
```

In the loop it is one line per layer:

```
lr["energies"] = interaction_energies_batched(G, beta_values)   # {β: E_β}
```

The violation test is the scale-invariant Fix 2 criterion, applied to one $\beta$-series across layers:

```
ENERGY_VIOLATION_REL_TOL = 1e-3   # a drop counts only if > 0.1% of |E_prev|def energy_violation_severity(energies, rel_tol=ENERGY_VIOLATION_REL_TOL):    arr   = np.array(energies, dtype=np.float64)    diffs = np.diff(arr)    ref   = np.maximum(np.abs(arr[:-1]), 1e-12)   # |E| at the preceding layer    rel_drop = -diffs / ref                       # positive = energy fell    viol_mask   = rel_drop > rel_tol    return dict(        violation_layers = [i + 1 for i, v in enumerate(viol_mask) if v],        n_violations     = int(viol_mask.sum()),        max_severity     = float(rel_drop[viol_mask].max()) if viol_mask.any() else 0.0,        sum_severity     = float(rel_drop[viol_mask].sum()),        total_rel_change = float((arr[-1] - arr[0]) / max(abs(arr[0]), 1e-12)),    )
```

`total_rel_change`

is the net climb from first to last layer (the energy does rise overall — it just isn't monotone); `violation_layers`

are the steps where it dropped against the trend. The reporting layer runs this for every $\beta \in {0.1, 1.0, 2.0, 5.0}$ and also checks whether the violation layers coincide with the merge events from Group B — testing whether the repulsive moves happen *at* merges or independently of them.

</details>

**What came out:** Universal violation. Every model, every prompt, every $\beta$ tested shows at least one layer where $E_\beta$ falls against the trend — while the net change across the full depth is still positive (the energy climbs overall, it just refuses to climb monotonically). This is the cleanest hard result in Phase 1: monotonicity is the theory's most specific prediction and trained transformers break it without exception. The grounded qualification: the violation is most robust at moderate $\beta$ under the scale-invariant Fix 2 criterion; the claim that it holds at *every* $\beta$ including large $\beta$ is exactly what the relative-threshold rerun confirms, since the previous absolute gate could not be trusted there. The specific severity numbers (`max_severity`

, `sum_severity`

per model) are pending that rerun; the qualitative universality is not in doubt.

*At each violation layer, we find the individual token pairs whose contribution to the energy dropped the most — the pairs that moved apart. We flag whether they are structural tokens or semantic ones.*

**Why this instrument:** D.1 establishes *that* a repulsive move happened. D.2 asks *between which tokens*. This is the difference between a phenomenon and a mechanism. If the energy drops are driven by structural tokens — `[CLS]`

, `[SEP]`

, punctuation — the repulsion is plausibly positional or structural bookkeeping. If they are driven by content-bearing tokens, the repulsion is operating on semantic geometry. Either way, the specific pairs are the evidence handed to Phase 2, which asks whether the value matrix's mixed-sign eigenspectrum is the source of the repulsive directions. The localization turns "the theory is violated" into "here is the fingerprint of what violates it."

<details> <summary>How a pair's contribution to the energy change is isolated</summary>

The energy is a sum over pairs, so the *change* in energy between layer $L$ and $L+1$ decomposes exactly into per-pair contributions:

A pair with $\Delta_{ij} < 0$ contributed to the energy *drop* — its tokens moved apart (lower inner product at $L+1$ than at $L$). The most-negative $\Delta_{ij}$ are the pairs most responsible for the violation. Sorting all pairs by $\Delta$ ascending and taking the top few gives the localization.

Each token in a top pair is checked against a set of structural tokens — `[CLS]`

, `[SEP]`

, `<s>`

, `</s>`

, padding, and single-character punctuation. The interpretation forks:

Once the cloud collapses to a near-point-mass, the inner products are all $\approx 1$, the $e^{\beta \cdot 1}$ terms are nearly equal, and the per-pair deltas are floating-point noise. Energy "violations" detected there are not real, so violation layers in the degenerate regime are suppressed and the pair localization is not reported for them — the same rank-gate logic as the rest of Phase 1.

</details> <details> <summary>The code: the per-pair delta and the structural-token flag</summary>

The public wrappers are thin — one normalizes raw tensors, one takes the pre-normed arrays from the loop — and both call a shared core (the formula below is verbatim from the function's contract):

``` python
def energy_drop_pairs_from_normed(normed_before, normed_after, beta, top_k=10):    """Token pairs (i,j) sorted by Δ ascending (most negative first), where       Δ = [exp(β⟨x_i,x_j⟩_after) − exp(β⟨x_i,x_j⟩_before)] / (2β n²)."""    return _energy_drop_pairs_core(normed_before, normed_after, beta, top_k)
```

The core is the pairwise delta, upper triangle only, sorted to surface the most repulsive pairs:

``` python
def _energy_drop_pairs_core(normed_before, normed_after, beta, top_k):    n = normed_before.shape[0]    if n < 2:        return []    G_before = normed_before @ normed_before.T    G_after  = normed_after  @ normed_after.T    delta = (np.exp(beta * G_after) - np.exp(beta * G_before)) / (2 * beta * n * n)    iu = np.triu_indices(n, k=1)                       # i < j only    pairs = [(int(i), int(j), float(delta[i, j])) for i, j in zip(*iu)]    pairs.sort(key=lambda t: t[2])                     # ascending → most negative first    return pairs[:top_k]
```

In the loop, this is computed only at violation layers and only when the layer is non-degenerate — using `prev_normed`

(the bug fixed in Fix 8: the previous-layer array must still point at $L$, not be overwritten to $L+1$ before this call):

```
# prev_normed still points to layer L here (Fix 8 deferred its update)lr["energy_drop_pairs"] = {    beta: energy_drop_pairs_from_normed(prev_normed, normed, beta, top_k=10)    for beta in beta_values} if (is_violation and lr["effective_rank"] >= DEGENERATE_RANK_THRESHOLD) else {}
```

The reporting layer maps the top pairs back to token strings and annotates structural ones:

```
SPECIAL_TOKENS = {"[CLS]","[SEP]","<s>","</s>","<pad>","[PAD]","<|endoftext|>","Ġ","▁"}PUNCT_CHARS    = set(".,!?;:'\"-–—()[]{}…/\\")# each top pair printed as:  'tok_i'[FLAG] ↔ 'tok_j'[FLAG]  with δ# [FLAG] ∈ {[CLS], [SEP], [PUNCT]} marks structural/special-token repulsion
```

</details>

**What came out:** *(Localization patterns provisional pending rerun; the existence of localizable drops is confirmed.)* The violations localize to specific, identifiable pairs rather than being diffuse noise across all pairs — which is itself evidence they are mechanistic, not numerical. The structural-vs-semantic split is the detail Phase 2 needs, and the report already separates the two; whether structural tokens dominate the repulsion (positional bookkeeping) or semantic tokens do (repulsion on content) is the question the rerun answers per model. The violation layers' coincidence (or not) with Group B's merge events is computed in the same place: it tests whether the repulsive moves are the mechanism of merging or a separate phenomenon.

The energy result is the load-bearing transition out of Phase 1. The clustering is real (A), it persists (B), and attention implements it (C, provisionally) — but it is not the pure-attraction gradient flow the theory describes, because that flow cannot lower the energy and the trained models lower it everywhere (D). The repulsion has to come from somewhere the idealized model omits. The natural suspect is the value matrix $V$: the idealized dynamics assume $V = I$ (purely attractive), while a trained $V$ with negative eigenvalues would supply exactly the repulsive directions needed to push token pairs apart and drop the energy. Group D's localized pairs — which tokens, at which layers, structural or semantic — are the targets Phase 2 cross-references against $V$'s eigenspectrum. The violation is not a failure of the measurement; it is the finding.

*Code excerpts are from *`core/metrics.py`

* (*`interaction_energies_batched`

*, *`energy_violation_severity`

*, *`energy_drop_pairs_from_normed`

*, *`_energy_drop_pairs_core`

*) and *`analysis.py`

* / *`reporting.py`

* (loop usage, structural-token flagging). The public energy functions and the violation criterion are verbatim; the localization core is shown faithful to its documented formula and test-pinned behavior. The universal-violation finding is confirmed (per the draft's epistemic status); severity magnitudes and localization patterns will be re-derived after the Fixes 1–8 rerun under the scale-invariant threshold.*

> **Finding — FALSIFIED.** Universal violation: every model, every prompt, every β shows at least one layer where E_β falls against the trend, while the net change across the full depth is still positive (the energy climbs overall — it just refuses to climb *monotonically*). Monotonicity is the theory's most specific prediction, and trained transformers break it without exception. The violations localize to specific, identifiable token pairs rather than being diffuse — itself evidence they are mechanistic, not numerical. **Caveat (PROVISIONAL):** the violation is most robust at moderate β under the scale-invariant Fix 2 criterion; the claim that it holds at *every* β including large β is exactly what the relative-threshold rerun confirms (the old absolute gate could not be trusted there), and the severity magnitudes and structural-vs-semantic split are pending that rerun. The qualitative universality is not in doubt. The natural suspect for the repulsion is the value matrix V: the idealized dynamics assume V = I (purely attractive), while a trained V with negative eigenvalues supplies exactly the repulsive directions needed — the Phase 2 target.

*Question: the metastability conjecture requires not just that clusters form and persist, but that the formation timescale and the collapse timescale are meaningfully different. Is there actually a gap?*

Metastability is a two-timescale claim. Tokens group quickly into clusters (fast), then those clusters slowly merge until the cloud collapses (slow). Groups A–C confirmed both halves happen; Group D confirmed the energy that should govern them misbehaves. Group E asks the quantitative question that decides whether "metastability" is the right word at all: is the slow timescale actually slow *relative to* the fast one? If clusters form and then immediately collapse, the two timescales coincide and there is no metastable window — just a fast transition with a clustered-looking midpoint.

This is the most provisional group in Phase 1, for two reasons that are worth stating before any numbers. First, the original way of computing the separation ratio was confounded, and produced a table that contradicts itself — Fix 5 redefined it, and the corrected numbers do not exist until the rerun. Second, the instrument that actually *explains* the separation pattern is not the ratio at all but the eigenspectrum of the value matrix $V$, and that finding (spectral radius governs collapse speed, not depth or dimension) is both the cleaner result and the one that falsifies a specific theorem. So read E.1 as a measurement under repair and E.2 as the result that survives.

*A single ratio per run: how long the metastable window lasts (merge time) divided by how long it took to form (formation time). A large ratio means a genuine slow timescale; a ratio near 1 means the two timescales coincide.*

**Why this instrument:** It is the direct operationalization of the conjecture. "Two timescales are separate" has to become a number before it can be confirmed or denied, and the natural number is the ratio of the slow duration to the fast duration. The subtlety — and the source of the trouble — is in defining those two durations so the ratio measures the dynamics and not an artifact of how the denominator was chosen.

<details> <summary>The two definitions of the ratio — the confounded one and the Fix 5 one</summary>

The first implementation defined separation as

The denominator is the problem. "Collapse onset" was measured on a *degenerate control input* (a string of repeated tokens), and how fast a model collapses a degenerate input is not comparable across architectures. BERT collapses the repeated-token control at layer 1 (onset = 1), so its ratio is plateau-width / 1 = 8.0 → CONFIRMED. GPT-2-medium collapses the same control at layer 10, so its ratio is 4.5 / 10 = 0.45 → NO SEPARATION. The numerator and denominator come from *different inputs*, and the denominator is dominated by an architecture-specific quirk of how each model treats degenerate text.

This produces the self-contradiction the draft flags: a 12-layer model (BERT) is CONFIRMED while a 24-layer model (GPT-2-medium) is NONE. If separation were a depth phenomenon, more layers should mean more separation. The table breaks any clean depth-threshold story — not because the dynamics are contradictory, but because the metric was measuring the wrong thing.

Fix 5 discards the cross-input ratio and computes everything from *one real-prompt run*:

Thresholds: ≥ 3.0 = CONFIRMED, 1.5–3.0 = WEAK, < 1.5 = NO SEPARATION.

The cluster count is read from HDBSCAN where available (spectral $k$ as fallback), so the multi-cluster test does not depend on the spectral method that degenerates for ALBERT-xlarge. The repeated-token control is *kept* — but only as a standalone "collapse speed diagnostic," never again as the ratio denominator.

Both quantities now come from the same trajectory, on real text, so the ratio is internally comparable. A formation_time of 4 and a merge_time of 16 is a clean "the slow phase is 4× the fast phase" statement; it no longer smuggles in how a different input behaves. The numbers this produces do not exist yet — the rerun generates them — so every separation verdict below is provisional.

</details> <details> <summary>The code: the superseded control ratio, and the Fix 5 single-run ratio</summary>

The old control-based ratio (now demoted to a standalone diagnostic, shown for contrast with what produced the contradictory table):

```
# SUPERSEDED — denominator comes from the repeated-tokens control inputmpw   = mean_plateau_width[model]          # numerator: real-prompt plateau widthonset = first_layer_with_mass_above_0p9    # denominator: COLLAPSE CONTROL inputratio = mpw / onsetinterp = ("TWO-TIMESCALE CONFIRMED" if ratio > 2.0 else          "WEAK SEPARATION"         if ratio > 1.0 else          "NO SEPARATION")
```

The Fix 5 single-run replacement (faithful to the Fix 5 specification — the function `_single_trajectory_separation`

is implemented per §0.5; this is its documented logic, not a verbatim paste):

``` php
def _single_trajectory_separation(layers_data) -> dict:    """Two-timescale separation from ONE real-prompt run.       formation_time = first stable multi-cluster plateau onset       merge_time     = layers from that plateau to collapse/decay       separation     = merge_time / formation_time"""    # cluster count per layer: HDBSCAN preferred, spectral k fallback    k_series = [layer_cluster_count(l) for l in layers_data]    # multi-cluster plateaus: detect_plateaus on k, keep windows with mean k >= 1.5    plats = detect_plateaus(k_series, window=2, tol=0.5)    multi = [(s, e) for (s, e, mean_k) in plats if mean_k >= 1.5]    if not multi:        return {"separation": 0.0, "verdict": "NO SEPARATION"}    formation_time = multi[0][0]                       # onset of first multi-cluster plateau    collapse_layer = first_collapse_or_decay(k_series) # k → 1 (or structure decays)    merge_time     = collapse_layer - formation_time    separation = merge_time / formation_time if formation_time > 0 else 0.0    verdict = ("CONFIRMED"      if separation >= 3.0 else               "WEAK"           if separation >= 1.5 else               "NO SEPARATION")    return {"formation_time": formation_time, "merge_time": merge_time,            "separation": separation, "verdict": verdict}
```

The repeated-tokens runs remain in the report under a **COLLAPSE SPEED DIAGNOSTICS** heading, explicitly marked as not a metastability test.

</details>

**What came out:** *(Provisional — the corrected ratio does not exist until the rerun.)* Under the old, confounded metric the verdicts were CONFIRMED for BERT-base (8.0), GPT-2-large (8.2), GPT-2-xl (7.62); NO SEPARATION for GPT-2-medium (0.45); WEAK for ALBERT-base at 36–48 iterations (1.06–1.25); and NO SEPARATION for ALBERT-xlarge (0.25). These should be read as superseded — the BERT-CONFIRMED / GPT-2-medium-NONE inversion is the artifact of the cross-input denominator, not a finding. The Fix 5 single-run ratio will replace all of them. What is *not* expected to change is the qualitative ordering, because — as E.2 shows — the ordering is governed by a quantity the ratio only indirectly reflects.

*The eigenvalues of the value matrix $V$. Its spectral radius (largest eigenvalue magnitude) predicts how fast a model collapses — better than the model's depth or its dimension.*

**Why this instrument:** This is the instrument that explains everything E.1 struggles to measure. The collapse speed of the dynamics is governed by the linear map the value pathway applies at each layer; the spectral radius of that map is the per-step amplification factor, so it sets the geometric rate of collapse. The paper's Theorem 6.1 predicts that higher dimension $d$ means faster convergence. The eigenspectrum lets us test that directly — and falsify it — by checking whether $d$ or the spectral radius is the variable that actually orders the models by collapse speed.

<details> <summary>What the V spectrum encodes, and the Theorem 6.1 test</summary>

For the value matrix $V$ (square, $d_{\text{model}} \times d_{\text{model}}$, for all three architectures' full projection), two spectra carry different information:

The spectral radius $\rho(V) = \max_k |\lambda_k|$ is the dominant per-layer amplification. For an iterated linear map, structure along the dominant eigendirection grows like $\rho^{,t}$ over $t$ layers. So $\rho$ controls how many layers it takes to collapse: a back-of-envelope estimate for ALBERT-xlarge, $\rho = 1.278$, is $\log(0.9)/\log(1.278) \approx 75$ iterations to reach high mass — which is why it has not collapsed at 48 and matches its observed slow dynamics.

Theorem 6.1 predicts higher $d$ → faster convergence. The decisive comparison is ALBERT-base ($d = 768$) vs. ALBERT-xlarge ($d = 2048$). If the theorem governs, the higher-dimensional xlarge should collapse *faster*. It collapses far *slower*: ALBERT-base collapses by ~24 iterations, ALBERT-xlarge has not collapsed at 48. The dimension prediction is backwards for this pair. The variable that does order them correctly is $\rho(V)$ — lower spectral radius, slower collapse — so the spectral radius, not $d$, is the governing quantity. That is a direct falsification of the Theorem 6.1 prediction in trained models.

Order the models by $\rho(V)$ rather than by depth and the contradiction dissolves. BERT-base has $\rho \approx 0.94$ — firmly in the fast-clean regime — despite being only 12 layers. GPT-2-medium has $\rho = 3.21$ (24 layers, no separation); GPT-2-large has $\rho = 1.38$ (36 layers, separation). The "depth threshold between 24 and 36 layers" is really a *spectral-radius* threshold that happens to fall between medium and large *within the GPT-2 family*. Depth is a within-family proxy that breaks the moment BERT (shallow but low-$\rho$) is included. The ratio table looked contradictory because it was sorted by the wrong axis.

</details> <details> <summary>The code: extracting V and computing the sign-aware spectrum</summary>

The value matrix is pulled per architecture (per-layer for BERT/GPT-2, one shared matrix for ALBERT), then the sign-aware eigenspectrum is computed:

```
# extraction differs by architecture; GPT-2 slices V out of the fused c_attnif "gpt2" in model_name:    for i, block in enumerate(model.h):        d = block.attn.c_attn.weight.shape[1]        v = block.attn.c_attn.weight[:, 2*d//3:].detach().cpu().float().numpy()  # last third = V        v_matrices.append((f"layer_{i}", v))# (bert: layer.attention.self.value.weight per layer;#  albert: one shared attn.value.weight)for name, V in v_matrices:    sv = svdvals(V)                                # magnitudes (sign-blind)    if V.shape[0] == V.shape[1]:        eigs       = np.linalg.eigvals(V)          # complex — carries sign + phase        real_arr   = np.real(eigs)        is_complex = np.abs(np.imag(eigs)) > 0.01 * (np.abs(real_arr) + 1e-8)        spec_radius = float(np.abs(eigs).max())    # ← collapse-rate predictor        frac_pos    = float((real_arr > 0).mean()) # attractive directions        frac_neg    = float((real_arr < 0).mean()) # repulsive directions        frac_complex = float(is_complex.mean())    # rotational directions
```

`spec_radius`

, `eig_frac_pos_real`

, `eig_frac_neg_real`

, and `eig_frac_complex`

are stored per layer in `v_eigenspectrum.json`

for cross-referencing against the Group A/B plateau and collapse locations.

**Methodological seam (Phase 2 refinement).** This Phase 1 instrument eigendecomposes $V$ alone. Phase 2 argues the physically meaningful operator is the composed **OV circuit** $W_O W_V$ — the actual residual-stream-to-residual-stream map — and that eigendecomposing $V$ in isolation omits the output projection. Phase 2's `extract_ov_circuit`

replaces this. The Phase 1 spectral-radius numbers are directionally robust (the ordering and the Theorem 6.1 falsification hold), but the precise operator is refined there; the Phase 1 figures should be read as the $V$-only proxy, not the final OV-circuit values.

</details>

**What came out:** The spectral radius of $V$ orders the models by collapse speed, and dimension does not. The GPT-2 family shows an abrupt drop in mean spectral radius — 3.21 at GPT-2-medium (24 layers) to 1.38 at GPT-2-large (36 layers), a 2.3× drop with no intermediate value — coinciding exactly with the onset of two-timescale separation. BERT-base ($\rho \approx 0.94$) behaves like the large/fast-clean regime despite its shallow depth, confirming that the governing variable is learned ($\rho$), not structural (depth or $d$). Theorem 6.1's dimension prediction is falsified by the ALBERT-base-vs-xlarge comparison: the higher-dimensional model collapses slower, opposite to the prediction, and $\rho(V)$ explains why. These spectral-radius numbers are from the Phase 1 $V$-only extraction; the qualitative claims survive the Phase 2 OV-circuit refinement, but the exact magnitudes will move.

The two-timescale separation, measured properly (Fix 5, single-run), is the open quantitative question whose numbers await the rerun — but the structure behind it is already clear from the value-matrix spectrum. Separation is real and architecture-specific, and it is governed by the spectral radius of $V$, not by depth or dimension. That single finding does three things: it resolves the self-contradiction in the old ratio table (wrong sorting axis), it falsifies the paper's Theorem 6.1 dimension prediction directly, and it hands Phase 2 a concrete target — the sign-and-phase structure of $V$'s eigenvalues, where the attractive/repulsive/rotational directions live that Group D's energy violations said must exist. The collapse-speed law that comes out of Phase 1 is therefore not "deeper models cluster faster" but "lower-spectral-radius value matrices collapse slower," with depth a within-family confound that the BERT comparison exposes.

*Code excerpts are from *`plots.py`

* (*`analyze_value_eigenspectrum`

*) and *`reporting.py`

* (the superseded collapse-control ratio block). The $V$-extraction and eigenspectrum code is shown close to verbatim; *`_single_trajectory_separation`

* is shown faithful to its Fix 5 specification (the implemented body was not directly retrievable). All separation verdicts are provisional pending the Fixes 1–8 rerun under the single-run definition; the spectral-radius ordering and the Theorem 6.1 falsification are the robust results, refined (not overturned) by Phase 2's OV-circuit operator.*

> **Finding — CONFIRMED (separation) / FALSIFIED (Thm 6.1).** The spectral radius of V orders the models by collapse speed, and dimension does not. The GPT-2 family shows an abrupt drop in mean ρ(V) — 3.21 at gpt2-medium (24 layers) to 1.38 at gpt2-large (36 layers), a 2.3× drop with no intermediate value — coinciding exactly with the onset of two-timescale separation. BERT-base (ρ ≈ 0.94) behaves like the fast-clean regime despite being only 12 layers, confirming the governing variable is *learned* (ρ), not structural (depth or d). Theorem 6.1's dimension prediction is falsified by the matched ALBERT pair: the higher-dimensional xlarge (d = 2048, ρ = 1.278) collapses *slower* than base (d = 768) — opposite to the prediction — and ρ(V) explains why, including why xlarge has not collapsed at 48 iterations (~75 iterations expected from its ρ). **Caveats:** the separation *ratios* under the corrected Fix 5 definition do not exist until the rerun (the old CONFIRMED/NONE verdicts are superseded artifacts of the cross-input denominator); and these are the Phase 1 *V-only* spectral numbers — Phase 2 refines the operator to the composed OV circuit W_O W_V, which moves the magnitudes but not the ordering or the falsification.

---

Clustering is universal, metastability is real, and the two timescales are real above a depth threshold. On the descriptive level, the idealized theory holds up well: trained transformers really do reorganize their tokens into clusters that persist and then slowly merge, across English prose of every length, and — pending the prompt-config reconciliation — across non-English text, code, and equations as well. The phenomenon is robust to what the model is reading.

But the energy never strictly increases throughout all layers. The interaction energy that the theory's gradient flow must drive monotonically upward falls somewhere in every single run. This is not measurement noise and it is not a rare edge case; it is a systematic feature of learned attention. The clustering is produced by something that contains a repulsive component, and the pure-attraction model has no such component. The trained dynamics *look like* the idealized flow from a distance and *are not* the idealized flow up close.

Dimension is a red herring. The theory predicts higher-dimensional models converge faster; the matched ALBERT pair shows the opposite, and the variable that actually orders the models by collapse speed is the spectral radius of the value matrix. Collapse speed is a learned quantity, not a structural one — which is why a shallow model (BERT) can sit in the fast-clean regime and a deep one need not.

This produces two regimes. Models with low spectral radius and weight-level clustering onset are attractor-driven — the architecture clusters at a fixed depth regardless of input (the GPT-2 family from large up, BERT by its ρ). Models whose clustering onset moves with the content are content-driven (GPT-2-small, BERT by its onset). The distinction is not depth; depth is a within-family proxy that breaks the moment BERT — shallow but low-ρ — is included.

The violations are the seam. Everything that confirms the theory hands forward cleanly; the one thing that breaks it is the most informative result in Phase 1, because it points at a specific mechanism. Phase 2 asks whether the value matrix's mixed-sign eigenspectrum is the source of the repulsion — whether the negative and complex eigenvalues of V (refined to the OV circuit) are where the energy-lowering, cluster-separating force actually lives.

The energy violations and the V spectral-radius finding hand directly to Phase 2: the localized energy-drop pairs are the targets, and the sign-and-phase structure of the value matrix (refined to the OV circuit W_O W_V) is where Phase 2 looks for the repulsive directions the violations imply must exist. The cluster labels and trajectories feed the later mechanistic work. The per-head Fiedler split is a standing architectural fact the later phases build on. Three loose ends close out of Phase 1 explicitly: the GPT-2 routing claim awaits the causal-mask baseline; the two-timescale ratios await the Fix 5 / Fix 7 rerun; and the untrained-control and full-prompt-set results await the config reconciliation. Phase 2c (the neural-dynamics comparison) is cut from the series.

---

*File manifest. Per-run outputs. Cross-run report location. Plateau-sensitivity sweep results. Untrained-control run. Length-sweep results.*

*Code excerpts throughout are from `core/models.py`, `core/metrics.py`, `core/clustering.py`, `sinkhorn.py`, `cluster_tracking.py`, `plots.py`, `analysis.py`, and `reporting.py`, plus the Phase 1 test suite. Snippets are lightly trimmed (comments and unrelated branches removed) but otherwise verbatim, except where a function body was reconstructed faithful to its specification (noted at the point of use). Provisional findings will be re-derived after the Fixes 1–8 rerun; the GPT-2 routing result is suspended pending the Fix 3 causal-mask baseline and BERT control.*
