cd /news/large-language-models/cross-architectural-runtime-probabil… · home topics large-language-models article
[ARTICLE · art-28155] src=discuss.huggingface.co ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Cross-architectural runtime probability dynamics in transformer LLMs — two clusters not explained by parameter count

A new measurement framework reveals that eight open-source transformer LLMs partition into two distinct clusters based on runtime probability distribution dynamics, with an order-of-magnitude gap in GD_ratio between clusters. The clustering is not explained by parameter count but may correlate with training corpus curation, as models trained on curated data (GPT-2, Phi-1.5, DistilGPT-2) form one cluster while those trained on heterogeneous web text (OPT, Pythia, Qwen, TinyLlama) form another. This finding suggests that tools assuming a single dynamic profile across transformers may produce inconsistent results depending on the model cluster.

read4 min publishedJun 15, 2026

I want to share a finding from a measurement framework I’ve been working

on, because the result is counterintuitive enough that I think it might

interest people thinking about architectural differences between

transformer LLMs.

The setup

I measured the runtime geometry of probability distributions across

eight open-source attention-based transformers ranging from 70M to 1.3B

parameters: Pythia-70M, DistilGPT-2, GPT-2, OPT-125M, Pythia-160M,

Qwen2.5-0.5B, TinyLlama-1.1B, and Phi-1.5.

For each (token, layer) point during inference, the framework computes

geometric properties of the probability distribution over the vocabulary:

entropy, concentration on the top candidates, competition between the

leading and runner-up tokens, dispersion above a 1% threshold. From these

metrics, a bicephalic operator separates two distinct geometric tensions

that probability distributions can carry, which I label G (concentration

pole) and D (competition pole). The ratio between mean G and mean D, what

I call the GD_ratio, becomes a per-model signature.

What I found

The eight models do not vary continuously on the GD_ratio. They partition

into two clusters with no overlap and roughly an order of magnitude of

gap between them:

GPT-2 GD_ratio 2.458

Phi-1.5 GD_ratio 1.764

DistilGPT-2 GD_ratio 1.577

Qwen-0.5B GD_ratio 0.079

OPT-125M GD_ratio 0.074

Pythia-70M GD_ratio 0.059

Pythia-160M GD_ratio 0.039

TinyLlama-1.1B GD_ratio 0.021

The cluster split appears on three independent components of the operator:

the GD_ratio itself, the mean G alone, and the mean D alone. The

separation is not an artifact of one metric.

The interesting part is what does not explain the clustering. Parameter

count does not. GPT-2 has 124M parameters and is in the upper cluster.

OPT-125M has 125M parameters and is in the lower cluster. Phi-1.5 has

1.3B parameters and sits with GPT-2. TinyLlama-1.1B has roughly the same

size as Phi-1.5 and sits with OPT.

What might explain it (hypothesis only) The most parsimonious pattern I can see is that the upper cluster shares

characteristics of training corpus curation. Phi-1.5 was trained on

heavily curated synthetic data. GPT-2 and DistilGPT-2 share the original

GPT-2 WebText distribution and tokenizer, which had its own filtering

protocol. The lower cluster spans more heterogeneous training corpora,

including older (OPT, Pythia) and newer (Qwen, TinyLlama) architectures trained on relatively unfiltered web text.

I want to be careful here: this is a hypothesis, not a finding. I do not

have an experimental setup that isolates training corpus from architecture

choices. The hypothesis is consistent with the data but cannot be

established by it.

Why this might matter

If the two-cluster structure generalizes, any tool or analysis that implicitly assumes a single dynamic profile across transformer models

will produce inconsistent results depending on which cluster the target

model falls into. This includes calibration techniques, uncertainty

estimation methods, and probably some interpretability approaches that

were tuned on one architectural family and may not transfer cleanly to

the other.

Other observations in the same study

A few things worth noting briefly:

The framework also defines a five-state taxonomy of dynamic regimes

(stable, hidden turbulence, surface branching, committed, full

bifurcation). The full bifurcation state turns out to be consistently

transient across architectures: on three primary models tested in depth,

its self-transition probability is 0.023 (GPT-2) or exactly 0.000

(OPT-125M, Qwen-0.5B). Models pass through this regime, they do not

settle into it.

Three models tested under controlled hidden-state perturbation respond

in qualitatively different ways. GPT-2 absorbs the perturbation with

state percentages shifting by less than 1.5 points. OPT-125M converts

the perturbation into surface dispersion (branching state rises +12.5

points). Qwen-0.5B destabilizes its dominant state (stable state drops

-18.8 points). Three architectural perturbation signatures, same input

noise.

One model (Phi-1.5) produces an anomalous taxonomy distribution under

the standard threshold rule. I report it openly as needing dedicated

investigation rather than smoothing it over.

What I’m not claiming

The panel is eight models, all under 1.3B parameters. The two-cluster

structure could collapse, stretch, or restructure when extended to 7B+

models. I have not validated on non-transformer architectures within

this study. The work is single-author and has not been independently

replicated. The training-corpus hypothesis is offered, not established.

I included explicit “limited findings” and “rejected claims” sections in

the paper, listing five things in each category that initial intuitions

suggested but that the data either partially support or actively reject.

I treat this as central to the framework’s credibility.

Why I’m posting

I would be interested in hearing whether anyone working with larger or

more architecturally diverse models has observed similar partitioning

phenomena in their own measurements, whether on attention, hidden states,

gradients, or any other intermediate quantity. The two-cluster structure

felt unexpected enough that I want to understand whether it is a

transformer-wide phenomenon, an artifact of the parameter range I tested,

or something specific to the particular operator I defined.

I would also be interested in alternative interpretations of the cluster

split beyond the training-corpus hypothesis. Possible candidates I have

considered but cannot test from this panel alone: pre-norm vs post-norm

architecture, tokenizer differences, attention head configurations,

intermediate layer dimensionality, positional encoding choices.

Where the details are

Full methodology, all tables, the explicit limitations section, and the

list of rejected claims are in the preprint on Zenodo:

Happy to discuss the operator definition, the threshold methodology, the

cluster finding, or any concern about the panel size and statistical

robustness.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/cross-architectural-…] indexed:0 read:4min 2026-06-15 ·