The Residual Stream Has a Geometry of Time

wpnews.pro

This is a preliminary writeup for an experiment on residual stream geometry. The research direction seems pretty underexplored, so I’m posting early to collect objections, research intuitions, and connections to problems other people are thinking about before I invest in the larger run.

The case for skimming this post: this experiment suggests transformers may keep track of context in a surprisingly compact way. Information that persists across many tokens is not diffuse across activation space; it concentrates in a low-dimensional subspace that can be projected out, compared to attention/MLP writes, and maybe even targeted by interventions.

The residual stream is commonly analogized to the transformer's "working memory." At each token position, a high-dimensional vector accumulates the attention and MLP deltas. This picture considers state transformations along the depth-time axis, i.e. layer by layer.

There is a second axis of sequence-time. Within a layer, the model must also keep track of information at position which is useful at position . This experiment aims to discover the geometry of how the model tracks state across tokens.

A direction's (sequence) timescale is the lag at which its sample autocorrelation first drops below . To calculate for direction , I project the residual stream onto at every token position, compute within-document autocorrelation curves, average them over documents to estimate lag-k autocorrelation , and set .

The primary experiment compares three probe families: 512 random directions (null baseline), 256 PCA directions (ranked by variance), and 256 time-lagged probes (multi-lag TICA style, ranked by persistence). Estimator details are in Appendix A.

I estimated the distribution of timescales across residual-stream directions in layer 12 of Gemma-2-2B on 5,000 C4 documents, then investigated the properties of the high- directions: where they live in the ambient space, how many there are, what they appear to semantically encode, and whether their high timescales are tied to sequential context or just unigram statistics.

Finding 1: The timescale distribution is extremely heavy-tailed, and the tail is carried by about 31 directions. Random and PCA directions have a 90th-percentile timescale of 1 token, carrying essentially no signal across positions. Time-lagged probes have a 90th-percentile timescale of 17 tokens.

To quantify the tail, define a direction's timescale excess as its timescale above the random-direction median,

and sort the excesses across the deduplicated eligible probe set. Roughly 31 directions carry about 80% of the total excess, . Also, the 31 directions are nonredundant, with low cosine sim and effective rank ~28/31 (orthogonality details in Appendix E).

Finding 2: Timescale depends on sequential order. Shuffling token positions within documents, which preserves the token multiset but destroys sequential context, collapses the top-decile timescale of slow probes from 17 tokens to 1 (94% reduction). This rules out token composition alone, such as recurring token clusters, as the reason for high .

This section asks two things with one test: 1) where in the residual stream the slow directions live and 2) whether they are the only high-timescale region. If persistent structure lived outside these directions then removing them would leave some behind, so a full collapse of held-out timescales is evidence they carry essentially all of it.

I build three -dimensional bases, project each out of the residual stream, and recompute timescales on held-out probes. The held-out probes are time-lagged directions fit on a disjoint train shard, have nearly identical timescale distribution as the original time-lagged probe set (Q90 around 19-20 versus 17) before projection, and are not eligible for the basis we project out (construction details are in Appendix B).

Projecting out the slow basis collapses persistence on held-out probes almost completely by (). Removing the same number of top-PCA directions collapses it just as completely, while removing 256 random directions barely touches it ().

Finding 3: Persistence depends on the high-variance substrate. It seems like a contradiction that PCA and slow directions are equally effective deletion bases, yet the 31-dim slow basis is only about 9% contained in PCA-31 ( effective dimensions of overlap) and about 52% in PCA-256. In angle terms, the slow span is far from PCA-31 (median principal angle ), though more overlapping with PCA-256 (median principal angle ).

The resolution is from distinguishing where a slow direction's energy sits from where its slowness comes from-- it may have about half its energy outside PCA-256, but that isn't what makes it slow. PCA diagonalizes the zero-lag covariance, so its components are uncorrelated at lag 0, and empirically each one alone is fast (low ). But the lagged covariance isn't diagonal in that basis! The slow directions are combinations of PCA components whose lagged cross-covariances reinforce each other, keeping the combined signal correlated out to large k. So slowness is a relational property of how components co-vary across lags, not a marginal property of any single component (a more rigorous treatment is in Appendix F).

An intuitive image is that a single PCA axis is a note that fades at once, while a slow direction is a chord whose overtones ring on, living in how the notes combine rather than in the loudest among them. So deleting the slow directions removes a slowness ingredient many computations ride on, not an isolated module.

Finding 1 gave 31 slow axes, but the objective could have picked special directions while generic directions in their span remain fast (low ). To test this, I sample random unit vectors inside the slow span, matched PCA spans, and the ambient space, then measure held-out timescale with the same estimator. Method and full set of checks are in Appendix E.

Subspace	Dim	Median	Fraction above ambient null
Slow subspace	31	25	1.00
PCA-31	31	1.5	0.50
PCA-128	128	1.0	0.29
PCA-256	256	1.0	0.15
Ambient	2304	1.0	n/a (defines null)

Test split, 512 sampled directions per subspace; ambient null = ambient-random Q95 timescale. Validation is near identical (slow subspace median val). Slow-subspace lower-tail quantiles on test: q05 = 10, q25 = 18, q50 = 25, q75 = 40, q95 = 82.

Finding 4: There exists a slow subspace. The median inside the slow span is 25, against 1 to 1.5 for matched-dimension PCA spans. Every sampled slow subspace direction beat the null while at least half of the PCA-control directions fell inside it, so there is no obvious "fast corner" of the slow subspace. The caveat is that 512 samples can support a claim about generic directions pretty well, but not a proof that every direction in the continuous span is slow.

But the subspace is not uniformly slow. Sweeping its dimension reveals a steep internal gradient.

Intuitively, a wider span includes increasingly fast directions, so a random vector places less of its energy on the slowest modes. Also note that the tail is independently slow, as a random mix of only the bottom-ranked directions (ranks 17 to 31) still has median . The sweep falls off steeply because the 31 directions span roughly 28 distinct geometric dims (geometric participation ratio 28) but concentrate their persistence into roughly five modes (slowness participation ratio 5).

Span	Median
top-1	492
top-2	298
top-3	201
top-5	116
top-8	87
top-13	30
top-21	27.5
top-31 (full span)	25
ranks 17-31 only	14

Test split; validation tracks within a token or two.

The structure is stable across splits (Spearman above between all split pairs, for random-in-span across validation and test, no train-to-test shrinkage). Full split-stability table and the test-inspection disclosure are in Appendix E.

Finding 5: The slow directions align with attention-output geometry. What in the model creates the slow directions? The natural hypothesis is that attention writes/maintains them, as it is the model's only mechanism for mixing information across positions.

However, raw overlap with attention outputs is a confounded test. Since attention writes into the residual stream, which is dominated by a few top PCA axes, any two subspaces drawn from it overlap by default. I measure each direction's overlap with the attention and MLP output subspaces and subtract its overlap with a matched-dimension PCA subspace to control for this.

For a unit direction and a subspace projection matrix , the overlap score is , and the excess attention and MLP overlaps are:

The curves show the selected-slow-direction minus random-direction median excess as progressively more PCA components are removed. Discussed more in Appendix C.

The slow directions have positive excess attention overlap, with median . The score is unitless and only means something against a baseline: random directions sit at (SD ), which means the slow directions are about SDs above the random mean.

Overall, this test finds that long-timescale directions lie unusually close to attention output subspaces, with weaker evidence for late-MLP overlap. It doesn't prove that attention causally maintains state in the slow subspace.

The goal is to characterize what tsemantics he slow subspace carries, but the trap is that top-token labels are nearly free. Regardless of timescale, any direction yields a plausible story from its top tokens. Instead, for each direction I look for long, high-projection runs on the validation split and count how many distinct documents contain at least one, so a single page can't drive the result. Details are in Appendix D.

The contrast is large. A generic rotation inside the slow subspace has a median qualifying run of 126/500 val docs, and still 106 at the Q10. The matched controls collapse, as random PCA-31 has median 18 docs and sub-5 at 10th percentile, while PCA-128, PCA-256, and ambient-random are in the low single digits.

The cautious verdict is that on C4 the slow subspace mostly tracks durable web-document state, including register, domain, formatting, source templates, and scraping artifacts. That is not yet evidence for clean semantic features or abstract reasoning variables. These directions behave like held state, lit steadily over a coherent passage, rather than higher-level variables updated during reasoning, and separating those two cases needs new experiments (see below).

Limitations of this section:

This experiment used one corpus, layer, and model, so it doesn't license any broad interpretability claims.

Currently, the contribution is the axis itself. Timescale ranks directions by what the model maintains across positions, a question sidestepped by the major decomp methods: PCA ranks directions by variance, SAEs find sparse features at single positions, and parameter decomp describes static weight structure. None of these directly answer what persists across time.

On generalization, my prior is that a slow subspace is roughly model-general but its orientation is corpus-dependent. Persistent state has to live somewhere but the residual stream is too anisotropic for it to spread evenly, so some compact slow geometry should recur.

A high- direction is likely one of two things. 1) Held state: the value sits at a steady level over a passage and varies between passages, e.g. what C4 appears to surface. 2) Live variable: the value changes over the passage while staying readable from the same direction, like a count, proof state, or intermediate conclusion. The second case is the LRH with dynamics: we still have variables that correspond to fixed directions, but the scalar value is updated over sequence-time.

Concrete experimental continuation:

A corpus that forces the distinction. Multi-step proofs, long-horizon code, or controlled reasoning tasks with a labeled running variable. Will also add sentence-level shuffling and paraphrase controls to cut the local lexical clustering that C4 leaves in.

A direct held vs. live discriminator. Regress each slow direction's within-document projection against the labeled quantity. If it tracks a document-level constant, it is held state. But if it tracks the running value, the slow subspace likely carries a live variable.

The interpretability payoff depends on which world that test lands in. If the slow subspace is only web-document register, timescale is less exciting. If it carries maintained reasoning variables, then some really interesting questions become askable: whether a steering vector shifts maintained state or just "greases the logits," whether features decompose into fast and slow components, whether sparse-feature methods could be run not over the whole activation space but constrained to the high-timescale region.

Let be the residual stream for document , token , and hook layer . Train-token centering uses

Validation/test tokens are not used to fit PCA directions, time-lagged probes, or centering statistics.

For a unit probe ,

Within-document demeaning:

Per-document lag- autocorrelation:

Probe-level curve, with equal document weighting:

where excludes document/probe/lag entries with near-zero lag-slice variance. Pilot requires .

I smooth with a centered moving average of width 5 for , set , and clip to . The reported timescale is

If no crossing occurs by , the probe is marked right-censored. In this pilot, no non-random probes were right-censored.

The time-lagged probes are fit on train documents only. Estimate

Use the symmetrized multi-lag covariance

Time-lagged probes are the leading generalized eigenvectors of

Conceptually, this maximizes a direction's multi-lag self-predictiveness, : the numerator rewards covariance with future positions, while the denominator prevents the estimator from merely selecting high-variance directions.

with

If , is doubled until the condition number falls below , up to .

The covariance objective chooses candidate directions; the held-out autocorrelation crossing defines the reported timescale. All reported values use this same within-document estimator. Their scale differs because they summarize different probe sets: the headline Q90 of 17 is computed across the original time-lagged probes, while the subspace battery in Appendix E samples rotations within the selected span and nested prefixes concentrated on its slowest directions.

This objective has standard precedents. It is closely related to slow feature analysis (Wiskott and Sejnowski, 2002) and to time-lagged independent component analysis (Perez-Hernandez et al., 2013), which extracts slow collective coordinates through a lagged-covariance generalized eigenproblem. TICA has a variational interpretation in terms of slow transfer-operator modes, making operator language a useful analogy. I use the narrower claim here: this probe fit surfaces slow linear residual-stream modes; it is not a full Koopman or dynamic-mode-decomposition analysis.

Fit health:

Diagnostic	Value
Lagged pair count	19,488,000
Train token count	4,096,000
Whitening PCs requested / used	512 / 512
Initial epsilon	0.00034823549316734574
Ridge doublings	0
Final condition number	147.78
Fit unstable	false
Anti-persistent or sign-changing count	0

Top generalized eigenvalues:

0.9082, 0.8654, 0.8534, 0.8218, 0.7620, 0.7357,
0.7230, 0.6835, 0.6419, 0.6071, 0.5948, 0.5680

Let be the top- eligible residual probes ranked by held-out within-document timescale after deduplication. I orthonormalize these directions to obtain a slow-direction basis .

For each residual vector, I remove the component in this basis:

I then recompute held-out probe projections,

and re-estimate the same within-document autocorrelation timescale used in the main analysis.

The held-out evaluation probes are fixed before projection-collapse evaluation:

The candidate time-lagged probes used to construct are fit on train shard A. The held-out time-lagged evaluation probes are fit on train shard B. Both are evaluated on the same validation/test documents. This makes the test non-circular: the evaluation probes are not used to construct the projected basis, but their before/after timescales are measured on the same held-out documents.

The corpus contained 5,000 C4 documents of length 1,024 tokens:

Bucket	Documents	Tokens
Full train split	4,000	4,096,000
Train shard A	3,200	3,276,800
Train shard B	800	819,200
Validation	500	512,000
Test	500	512,000

For the projection-collapse analysis, the train split was further partitioned using a held-out train fraction of 0.2. Candidate time-lagged probes used to construct the projected basis were fit on train shard A. The 128 held-out time-lagged evaluation probes, , were independently fit on train shard B and were never eligible for the projected basis.

Projection collapse was evaluated on validation during development and then confirmed on the test split using the same fixed probes and bases. No candidate probes, held-out probes, or projected bases were refit for the test confirmation.

For each held-out evaluation family , the collapse score is

I compute for three matched bases:

The random-control basis tests whether collapse is caused merely by deleting any dimensions. The PCA basis tests whether held-out persistence functionally depends on high-variance residual geometry.

For unitless overlap/excess scores , I report scale relative to matched random residual directions:

This is a scale calibration, not a Gaussian significance test.

For the attention-overlap result in Finding 5, . The reported slow-direction value, , is the median over the 31 selected slow directions. The random baseline uses the matched random residual directions: mean , standard deviation . Thus the slow-direction median is random-control SDs above the random mean.

This is a descriptive scale calibration, not a p-value. The 31 slow directions were selected by the pipeline and are not IID samples from a null distribution, so the calculation should be read as "large on the random-direction scale," not as a Gaussian-significance claim.

The residual-PCA subtraction is useful for controlling residual-stream anisotropy, but it should not be read literally for PCA axes themselves. Top PCA directions have higher raw attention overlap than the slow directions, but median . After subtracting the matched residual-PCA baseline, their median excess attention overlap is , about control SDs below the random mean. This does not mean PCA directions are "anti-attention"; their residual-PCA baseline is mechanically large because they are PCA axes. The point is just that raw attention overlap is not the signal.

The semantic readout is exploratory and is not used in locked quantitative results.

For slow direction , compute held-out validation projections:

For the document-coverage comparison, I apply the same procedure to ten sampled vectors from each span family. For each direction, I take the global q95 projection threshold, retain contiguous above-threshold token runs of at least 8 tokens, and collapse them to distinct validation documents. This asks how widely a direction sustains a long high-projection run without letting repeated runs from one page inflate the count. I also report the number of retained runs per covered document.

Across families the retained-runs-per-covered-document counts are about 4.9 for the selected slow directions, 3.8 for random-in-span rotations, and near 1 for the PCA and ambient controls, the same graded structure the participation ratios show: the objective picks the sharpest axes, and generic directions in their span are slower but softer.

As a separate window-independence check, I count distinct source documents among the top 12 projection windows. A lower count can be the expected signature of document-state rather than an inflation artifact: a direction that tracks document-level state should produce correlated windows within a page. The distinct-document run count is the safeguard against mistaking one repeated page for a corpus-wide effect. In the body result, the slow-subspace median of 126 distinct documents already rules out one over-represented page driving the effect. Separately, slow-subspace directions touch fewer distinct top-window documents than ambient-random ones (median 7 versus 12), which is the expected signature of document-level tracking rather than scattered token firing.

Finally, I inspect readable windows around upper- and lower-tail token positions of . Labels summarize recurring patterns in those tail examples.

These are qualitative projection-tail labels, not trained classifiers, causal features, or validated semantic variables. For example, “legal/privacy boilerplate” means that many tail examples had that style; it does not mean the direction exclusively encodes that concept.

This appendix details the diagnostics behind Finding 4. All quantities are evaluated on held-out validation and test documents using the timescale estimator of Appendix A. The battery operates on saved projection artifacts; it performs no additional model forward pass or probe refit beyond the sampling described here.

Random-in-span sampling. Let be the orthonormalized basis of the 31 directions from Finding 1. More generally, for a -dimensional subspace with orthonormal basis , I sample coefficients

and form

This gives a uniformly random unit direction inside the subspace. I then compute the same held-out within-document autocorrelation timescale for each sampled direction.

Matched controls. The same procedure is applied to the top- PCA subspaces for and to the full ambient space. The PCA-31 control is the critical matched-dimensional comparator: it tests whether selecting the same number of highest-variance residual axes is enough to recover a generically slow subspace. PCA-128 and PCA-256 test whether random rotations become slow merely by drawing them from progressively broader high-variance residual geometry.

Subspace-PCA geometry. Principal-angle and energy statistics between the 31-dimensional slow basis and the top- PCA subspaces (artifact-only; no rerun). Mean squared containment is the fraction of energy captured by the PCA subspace, ; effective overlap is the unnormalized numerator, with units of dimensions.

Comparison	Mean squared containment	Effective overlap	Median principal angle	Worst angle
vs PCA-31	8.99%	2.79 / 31	81.1°	90.0°
vs PCA-128	28.69%	8.89 / 31	-	-
vs PCA-256	52.06%	16.14 / 31	44.2°	62.8°

The 31 directions are also near-orthogonal to each other: median pairwise absolute cosine is , against a random-direction baseline of for independent unit vectors in , so the observed median is only about 1.7x the random-cosine SD. This is the within-set redundancy check; effective rank is about 28 of 31.

There is one exact shared direction (PC-1). Removing it, the remaining 30-dimensional slow rotation sits at a median principal angle of 81.6° from PCA axes 2-31. The slow-basis energy is spread broadly across the PCA spectrum:

PCA band	Fraction of energy
PC 1	3.2%
PCs 2-31	5.8%
PCs 32-64	7.1%
PCs 65-128	12.6%
PCs 129-256	23.4%
Outside PCA-256	47.9%

It takes 246 PCA axes to capture 50% of the slow-basis norm, and excluding PC-1 the captured energy is nearly uniform across PCs 2-256. So the slow subspace is not unusually repelled from PCA-31; it is a diffuse rotation across a broad high-variance substrate that the leading PCA frame does not recover. As an exploratory statistic, among the selected source axes slower directions have more energy captured by PCA-256 (Spearman , even excluding PC-1).

Classification rule (pre-registered). The slow subspace is classified "fat" when the median random-in-span exceeds the ambient-random Q90 null. The result is fat: median random-in-span (val), (test) against an ambient null of . The stronger "coordinate-free" claim, that essentially every direction in the span is slow, was not certified by this rule, because the locked rule tests only the median. The lower-tail analysis below is the empirical strengthening toward that stronger statement, but it is reported as exploratory rather than as a locked result.

Nested dimension sweep. Random-in-span sampling is repeated restricted to the top- directions for . Median declines with (, , , , , , , on test). The decline is partly mechanical, since a larger span admits progressively less-slow directions. The two non-mechanical observations are that the floor at () stays far above the null, and that random mixes of only ranks 17 to 31 retain median (val ).

Participation ratios. For the 31 directions I report two ratios using the participation-ratio functional . Applied to the covariance eigenvalues of the direction set it gives the geometric participation ratio, , confirming near-orthogonality and close-to-full rank. Applied to the timescale-excess values it gives the slowness participation ratio, , quantifying that persistence concentrates into roughly five effective dimensions despite the high geometric rank.

Lower-tail analysis. With a 5000-replicate direction bootstrap, the fraction of the 512 random-in-span directions exceeding the ambient Q95 null is , with quantiles q05 = , q25 = , q50 = , q75 = , q95 = (test). Matched controls: random PCA-31 has 50% above the null, PCA-128 has 29%, PCA-256 has 15%. The bootstrap characterizes Monte Carlo uncertainty over the already-sampled directions and does not substitute for a fresh prospective split.

Split stability. The 31 selected directions have median of (train), (val), (test), with Spearman correlations of across all split pairs and a train-to-test median shrinkage ratio of . The random-in-span endpoint reproduces across val and test with Spearman and median absolute change of 1, with the fraction above the per-split null equal to on both. This rules out the shrinkage regime in which directions are slow on the fit shard and regress to the null on held-out data.

Disclosure. The test split had been inspected elsewhere in the workflow before the battery rule was locked, so test results are formally exploratory. Validation is the independent held-out reference and tells the same story.

Write a slow direction in the full PCA basis,

where are the PCA axes and is its energy on component . Its lag- autocovariance is

where is the lag- cross-covariance of PCA components and .

PCA diagonalizes , so the components are uncorrelated at lag 0. Empirically, the diagonal terms also decay quickly: individual PCA axes are fast. The lagged covariance is not diagonal in the PCA basis, so the off-diagonal terms can matter. The time-lagged objective selects combinations of PCA components whose cross-lag terms reinforce, keeping large out to large relative to the variance . Slowness is therefore an off-diagonal property of how components co-vary across lags, not a property of any single PCA component.

This also explains how PCA-256 and the slow basis can both be effective deletion bases without being the same object. PCA-256 does not contain the slow basis: it captures about half of its energy, while the rest lies outside PCA-256. But the collapse result suggests that the lagged structure needed for persistence depends strongly on this broad high-variance substrate. Projecting out PCA-256 removes enough of the shared components to break the reinforcing cross-lag combinations; projecting out the slow basis removes those combinations more directly.

Identity is not symmetric. PCA-256 contains no individually slow axis: each PCA axis has . The slow basis is the particular orientation in which components combine into long-timescale directions. One is an efficient remover of slowness that is itself fast; the other is slow.

source & further reading

lesswrong.com — original article A Multi-Agent Extension for Petri In other words: The influence of prompt variation on alignment evals Can an LLM make a feature-length movie on its own?

The Residual Stream Has a Geometry of Time

Run your AI side-project on zahid.host