Dispersion loss counteracts embedding condensation in small language models

wpnews.pro

cd /news/large-language-models/dispersion-loss-counteracts-embeddin… · home › topics › large-language-models › article

[ARTICLE · art-47415] src=chenliu-1996.github.io ↗ pub=2026-07-03T22:35Z topic=large-language-models verified=true sentiment=· neutral

Dispersion loss counteracts embedding condensation in small language models

Researchers found that smaller language models suffer from 'embedding condensation,' where token embeddings collapse into a narrow cone, reducing expressivity. They introduced a dispersion loss to counteract this effect, improving small model performance without increasing parameters.

read3 min views1 publishedJul 3, 2026

More severe in smaller models than in larger counterparts (Figure 2).

What makes LLMs better than small LMs? Data? Parameters? Geometry might play a role!

Every Transformer layer of a language model represents each input token as a vector in a high-dimensional embedding space. We notice that as those vectors progress through Transformer layers, they often behave as if they were confined to a narrow cone: they point to increasingly similar directions as measured by pairwise cosine similarity. We call this geometric phenomenon embedding condensation. This phenomenon is:

More severe in smaller models than in larger counterparts (Figure 2).

Reproducible under confounder-controlled settings (Figure 3).

Emerging at model initialization and gets alleviated by pre-training (Figure 4).

Not resolved by knowledge distillation from a larger model (Figure 5).

This paper presents an observation-driven improvement on language model training.

We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in smaller language models. We then design a training objective called dispersion loss to counteract the effect.

Feature 1: Larger model, less condensation.

Within the same model family, smaller models exhibit more severe embedding condensation, with token embeddings collapsing toward near-parallel directions, while larger models resist this collapse.

This effect is also quite robust to the choice of input datasets.

Feature 2: Reproducible when controlling for confounders.

To isolate the effect of model size from other confounding factors, we conduct a controlled experiment where we pre-train GPT2-like models, varying only the MLP dimension while keeping all other components fixed, including the number of layers, embedding dimension, dataset, and training settings. The same phenomenon is observed.

Feature 3: Condensation occurs early on.

The embedding condensation phenomenon emerges at model initialization and is gradually mitigated, not exacerbated, by pre-training.

Feature 4: Distillation is not a solution.

Knowledge distillation from a larger model does not transfer the desired resistance to embedding condensation.

Dispersion loss

Embedding condensation reduces the expressivity of Transformers by collapsing token embedding vectors into narrow cones, under-utilizing the representation space. We hypothesize that by dispersing embeddings during training, smaller models can achieve representational qualities more similar to larger models, thus narrowing the performance gap without increasing the number of parameters.

Our dispersion loss is inspired by the "Diffuse and Disperse" paper with practical modifications.

Dispersion loss counteracts the embedding condensation effect during mid-training and pre-training. A qualitative result is shown below, while more quantitative results can be found in the paper.

Conclusion

Larger language models are better than smaller language models, but might not merely because they have more parameters. It can be partially attributed to how they organize the information in the latent representations. We hope to see future efforts along this interesting direction.

If you are thinking about reproducing this work or borrowing pieces of it, here are my two cents.

I personally highlight a few directions that seem potentially meaningful.

@inproceedings{liu2026dispersion,
  title={Dispersion loss counteracts embedding condensation and improves generalization in small language models},
  author={Liu, Chen and Sun, Xingzhi and Xiao, Xi and Van Tassel, Alexandre and Xu, Ke and Reimann, Kristof and Liao, Danqi and Gerstein, Mark and Wang, Tianyang and Wang, Xiao and Krishnaswamy, Smita},
  booktitle={International Conference on Machine Learning},
  year={2026},
  organization={PMLR}
}

source & further reading

chenliu-1996.github.io — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/dispersion-loss-countera…

Read original on chenliu-1996.github.io → chenliu-1996.github.io/projects/LM-Dispersion/

mentioned entities

GPT-2

International Conference on Machine Learning

Chen Liu

Xingzhi Sun

Xi Xiao

Alexandre Van Tassel

Ke Xu

Kristof Reimann

metadata

slugdispersion-loss-counteracts-embedding-condensation-in-small-language-models

topic#large-language-models

secondary2 topics

sentimentneutral

canonicalchenliu-1996.github.io

navigation

← prevLeanstral 1.5: Proof Abundance f…

next →Elevating Privileges from Firefo…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 3 Jul · #large-language-models

2 TB of Ukrainian Law + DeepSeek V3 860B on GCP: What We'd Get

machinebrief.com · 1 Jul · #large-language-models

Can AI Handle Impossible Languages? Not So Fast.

pengrui-han.github.io · 1 Jul · #large-language-models

Modular Cognitive Architecture Emerges in Large Language Models

arxiv.org · 1 Jul · #large-language-models

When transformers learn "impossible" languages, what do they learn?

── more on @gpt-2 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 1 Jul · #ai-infrastructure

My Notes After Databricks Data and AI Summit 2026

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required