I want to share a finding from a measurement framework I’ve been working
on, because the result is counterintuitive enough that I think it might
interest people thinking about architectural differences between
transformer LLMs.
The setup
I measured the runtime geometry of probability distributions across
eight open-source attention-based transformers ranging from 70M to 1.3B
parameters: Pythia-70M, DistilGPT-2, GPT-2, OPT-125M, Pythia-160M,
Qwen2.5-0.5B, TinyLlama-1.1B, and Phi-1.5.
For each (token, layer) point during inference, the framework computes
geometric properties of the probability distribution over the vocabulary:
entropy, concentration on the top candidates, competition between the
leading and runner-up tokens, dispersion above a 1% threshold. From these
metrics, a bicephalic operator separates two distinct geometric tensions
that probability distributions can carry, which I label G (concentration
pole) and D (competition pole). The ratio between mean G and mean D, what
I call the GD_ratio, becomes a per-model signature.
What I found
The eight models do not vary continuously on the GD_ratio. They partition
into two clusters with no overlap and roughly an order of magnitude of
gap between them:
GPT-2 GD_ratio 2.458
Phi-1.5 GD_ratio 1.764
DistilGPT-2 GD_ratio 1.577
Qwen-0.5B GD_ratio 0.079
OPT-125M GD_ratio 0.074
Pythia-70M GD_ratio 0.059
Pythia-160M GD_ratio 0.039
TinyLlama-1.1B GD_ratio 0.021
The cluster split appears on three independent components of the operator:
the GD_ratio itself, the mean G alone, and the mean D alone. The
separation is not an artifact of one metric.
The interesting part is what does not explain the clustering. Parameter
count does not. GPT-2 has 124M parameters and is in the upper cluster.
OPT-125M has 125M parameters and is in the lower cluster. Phi-1.5 has
1.3B parameters and sits with GPT-2. TinyLlama-1.1B has roughly the same
size as Phi-1.5 and sits with OPT.
What might explain it (hypothesis only) The most parsimonious pattern I can see is that the upper cluster shares
characteristics of training corpus curation. Phi-1.5 was trained on
heavily curated synthetic data. GPT-2 and DistilGPT-2 share the original
GPT-2 WebText distribution and tokenizer, which had its own filtering
protocol. The lower cluster spans more heterogeneous training corpora,
including older (OPT, Pythia) and newer (Qwen, TinyLlama) architectures trained on relatively unfiltered web text.
I want to be careful here: this is a hypothesis, not a finding. I do not
have an experimental setup that isolates training corpus from architecture
choices. The hypothesis is consistent with the data but cannot be
established by it.
Why this might matter
If the two-cluster structure generalizes, any tool or analysis that implicitly assumes a single dynamic profile across transformer models
will produce inconsistent results depending on which cluster the target
model falls into. This includes calibration techniques, uncertainty
estimation methods, and probably some interpretability approaches that
were tuned on one architectural family and may not transfer cleanly to
the other.
Other observations in the same study
A few things worth noting briefly:
The framework also defines a five-state taxonomy of dynamic regimes
(stable, hidden turbulence, surface branching, committed, full
bifurcation). The full bifurcation state turns out to be consistently
transient across architectures: on three primary models tested in depth,
its self-transition probability is 0.023 (GPT-2) or exactly 0.000
(OPT-125M, Qwen-0.5B). Models pass through this regime, they do not
settle into it.
Three models tested under controlled hidden-state perturbation respond
in qualitatively different ways. GPT-2 absorbs the perturbation with
state percentages shifting by less than 1.5 points. OPT-125M converts
the perturbation into surface dispersion (branching state rises +12.5
points). Qwen-0.5B destabilizes its dominant state (stable state drops
-18.8 points). Three architectural perturbation signatures, same input
noise.
One model (Phi-1.5) produces an anomalous taxonomy distribution under
the standard threshold rule. I report it openly as needing dedicated
investigation rather than smoothing it over.
What I’m not claiming
The panel is eight models, all under 1.3B parameters. The two-cluster
structure could collapse, stretch, or restructure when extended to 7B+
models. I have not validated on non-transformer architectures within
this study. The work is single-author and has not been independently
replicated. The training-corpus hypothesis is offered, not established.
I included explicit “limited findings” and “rejected claims” sections in
the paper, listing five things in each category that initial intuitions
suggested but that the data either partially support or actively reject.
I treat this as central to the framework’s credibility.
Why I’m posting
I would be interested in hearing whether anyone working with larger or
more architecturally diverse models has observed similar partitioning
phenomena in their own measurements, whether on attention, hidden states,
gradients, or any other intermediate quantity. The two-cluster structure
felt unexpected enough that I want to understand whether it is a
transformer-wide phenomenon, an artifact of the parameter range I tested,
or something specific to the particular operator I defined.
I would also be interested in alternative interpretations of the cluster
split beyond the training-corpus hypothesis. Possible candidates I have
considered but cannot test from this panel alone: pre-norm vs post-norm
architecture, tokenizer differences, attention head configurations,
intermediate layer dimensionality, positional encoding choices.
Where the details are
Full methodology, all tables, the explicit limitations section, and the
list of rejected claims are in the preprint on Zenodo:
Happy to discuss the operator definition, the threshold methodology, the
cluster finding, or any concern about the panel size and statistical
robustness.