I’ve spent the last few months trying to answer a question that initially looked much simpler than it actually is:
What actually happens inside an LLM while it is generating a response?
Most work evaluates language models through their outputs (benchmarks, perplexity, reasoning scores…). I decided to look at something different: the evolution of the hidden representations themselves.
I built a runtime framework that records hidden states layer-by-layer during inference and started running the same experiments across multiple open-weight models (GPT-2, DistilGPT2, OPT-125M, Qwen2.5-0.5B-Instruct, TinyLlama, Phi-1.5 and Llama-3.2-1B).
I expected a relatively straightforward result.
Instead, every new experiment generated a new question.
Some of the observations so far are:
• Hidden-state trajectories are not random. They exhibit reproducible internal dynamical regimes across architectures.
• Functional proxy states (syntax-like processing, decision-like behavior and output stabilization) can be detected consistently enough to cluster models according to their internal dynamics rather than simply their parameter count.
• These functional signatures remain reasonably stable across different prompt families, although not perfectly, suggesting that prompt content modulates the dynamics without completely changing the internal organization.
• Linear probes can decode several functional categories directly from hidden representations with surprisingly high accuracy.
At that point the obvious question became:
Are we just overfitting labels?
So I started adding progressively stronger negative controls.
First:
Then:
Then:
Finally:
The results became much more interesting.
Random labels collapse the decoding performance.
Random Gaussian representations also collapse it.
Feature permutation destroys most of the signal.
However…
Orthogonal rotations preserve almost all decoding performance.
This strongly suggests that the relevant information is not encoded in individual neurons or embedding dimensions.
Instead, it appears to be encoded in the relative geometry of the representation.
That was not the result I expected.
Another unexpected finding concerns depth.
Initially I was looking for something like “syntax layers” or “semantic layers”.
The data doesn’t really support such a simple picture.
Instead, the same functional signatures seem capable of appearing at different absolute layers depending on the architecture.
This led me to think less in terms of fixed layers and more in terms of functional regimes evolving through computation.
At this stage I am not claiming to have discovered a universal law of transformers.
These are empirical observations obtained on a limited set of open-weight models.
What I do believe is that they raise interesting questions about how computation is actually organized inside modern LLMs.
I’d really appreciate feedback from people working on:
mechanistic interpretability
representation learning
probing methods
transformer internals
geometry of representations
In particular I’d like your opinion on three questions:
Which control experiment would you absolutely require before taking these observations seriously?
Have you seen previous work showing comparable evidence that functional information is primarily encoded in representation geometry rather than individual dimensions?
If you were extending this project, what would be your next experiment? I’m not affiliated with a research lab—this is an independent research project. I’m sharing it because I would genuinely value critical feedback more than validation.
If there’s enough interest, I’m happy to share the methodology, code, and experimental reports.