{"slug": "dispersion-loss-counteracts-embedding-condensation-in-small-language-models", "title": "Dispersion loss counteracts embedding condensation in small language models", "summary": "Researchers found that smaller language models suffer from 'embedding condensation,' where token embeddings collapse into a narrow cone, reducing expressivity. They introduced a dispersion loss to counteract this effect, improving small model performance without increasing parameters.", "body_md": "More severe in smaller models than in larger counterparts (Figure 2).\n\nWhat makes LLMs better than small LMs? Data? Parameters? Geometry might play a role!\n\nWhat makes LLMs better than small LMs? Data? Parameters? Geometry might play a role!\n\nEvery Transformer layer of a language model represents each input token as a vector in a\nhigh-dimensional embedding space. We notice that as those vectors progress through\nTransformer layers, they often behave as if they were confined to a narrow cone: they point\nto increasingly similar directions as measured by pairwise cosine similarity. We call this\ngeometric phenomenon **embedding condensation**. This phenomenon is:\n\nMore severe in smaller models than in larger counterparts (Figure 2).\n\nReproducible under confounder-controlled settings (Figure 3).\n\nEmerging at model initialization and gets alleviated by pre-training (Figure 4).\n\nNot resolved by knowledge distillation from a larger model (Figure 5).\n\n**This paper presents an observation-driven improvement on language model\ntraining.**\n\nWe observe a geometric phenomenon which we term **embedding condensation**,\nwhere token embeddings collapse into a narrow cone-like subspace in smaller language models.\nWe then design a training objective called dispersion loss to counteract the effect.\n\n**Feature 1: Larger model, less condensation.**\n\nWithin the\nsame model family, smaller models exhibit more severe embedding condensation, with token\nembeddings collapsing toward near-parallel directions, while larger models resist this\ncollapse.\n\nThis effect is also quite robust to the choice of input datasets.\n\n**Feature 2: Reproducible when controlling for\nconfounders.**\n\nTo isolate the effect of model size from other confounding\nfactors, we conduct a controlled experiment where we pre-train GPT2-like models, varying\nonly the MLP dimension while keeping all other components fixed, including the number of\nlayers, embedding dimension, dataset, and training settings. The same phenomenon is\nobserved.\n\n**Feature 3: Condensation occurs early on.**\n\nThe embedding\ncondensation phenomenon emerges at model initialization and is gradually mitigated, not\nexacerbated, by pre-training.\n\n**Feature 4: Distillation is not a solution.**\n\nKnowledge\ndistillation from a larger model does not transfer the desired resistance to embedding\ncondensation.\n\n**Dispersion loss**\n\nEmbedding condensation reduces the\nexpressivity of Transformers by collapsing token embedding vectors into narrow cones,\nunder-utilizing the representation space. We hypothesize that by dispersing embeddings\nduring training, smaller models can achieve representational qualities more similar to\nlarger models, thus narrowing the performance gap without increasing the number of\nparameters.\n\nOur dispersion loss is inspired by the \"[Diffuse\nand Disperse](https://arxiv.org/abs/2506.09027)\" paper with practical modifications.\n\nDispersion loss counteracts the embedding condensation effect during mid-training and pre-training. A qualitative result is shown below, while more quantitative results can be found in the paper.\n\n**Conclusion**\n\nLarger language models are better than\nsmaller language models, but might not merely because they have more parameters. It\ncan be partially attributed to how they organize the information in the latent\nrepresentations. We hope to see future efforts along this interesting direction.\n\nIf you are thinking about reproducing this work or borrowing pieces of it, here are my two cents.\n\nI personally highlight a few directions that seem potentially meaningful.\n\n```\n@inproceedings{liu2026dispersion,\n  title={Dispersion loss counteracts embedding condensation and improves generalization in small language models},\n  author={Liu, Chen and Sun, Xingzhi and Xiao, Xi and Van Tassel, Alexandre and Xu, Ke and Reimann, Kristof and Liao, Danqi and Gerstein, Mark and Wang, Tianyang and Wang, Xiao and Krishnaswamy, Smita},\n  booktitle={International Conference on Machine Learning},\n  year={2026},\n  organization={PMLR}\n}\n```\n\n", "url": "https://wpnews.pro/news/dispersion-loss-counteracts-embedding-condensation-in-small-language-models", "canonical_source": "https://chenliu-1996.github.io/projects/LM-Dispersion/", "published_at": "2026-07-03 22:35:47+00:00", "updated_at": "2026-07-03 22:49:42.356700+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "natural-language-processing"], "entities": ["GPT-2", "International Conference on Machine Learning", "Chen Liu", "Xingzhi Sun", "Xi Xiao", "Alexandre Van Tassel", "Ke Xu", "Kristof Reimann"], "alternates": {"html": "https://wpnews.pro/news/dispersion-loss-counteracts-embedding-condensation-in-small-language-models", "markdown": "https://wpnews.pro/news/dispersion-loss-counteracts-embedding-condensation-in-small-language-models.md", "text": "https://wpnews.pro/news/dispersion-loss-counteracts-embedding-condensation-in-small-language-models.txt", "jsonld": "https://wpnews.pro/news/dispersion-loss-counteracts-embedding-condensation-in-small-language-models.jsonld"}}