A comparative study of transformer-based embeddings for topic coherence

wpnews.pro

cd /news/natural-language-processing/a-comparative-study-of-transformer-b… · home › topics › natural-language-processing › article

[ARTICLE · art-17174] src=arxiv.org ↗ pub=2026-05-29T04:00Z topic=natural-language-processing verified=true sentiment=· neutral

A comparative study of transformer-based embeddings for topic coherence

A new study systematically examined the effect of transformer-based language model size on topic quality, finding that model size had a negligible impact on coherence and divergence metrics. Researchers tested seven models ranging from 22 million to 13 billion parameters in a BERTopic pipeline across multiple corpora. The findings suggest that smaller models can achieve comparable topic quality to larger models, challenging assumptions about the necessity of massive language models for topic modeling tasks.

read1 min views16 publishedMay 29, 2026

arXiv:2605.28832v1 Announce Type: new Abstract: Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{"o}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/a-comparative-study-of-t…

Read original on arxiv.org → arxiv.org/abs/2605.28832

mentioned entities

Latent Dirichlet Allocation

MiniLM

LLaMA-2

BERTopic

Röder

metadata

sluga-comparative-study-of-transformer-based-embeddings-for-topic-coherence

topic#natural-language-processing

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevChatGPT glitch is leaking OpenAI…

next →New infosec products of the mont…

── more in #natural-language-processing 4 stories · sorted by recency

machinebrief.com · 14 Jul · #natural-language-processing

AI in the Classroom: Revolution or Hype?

machinebrief.com · 14 Jul · #natural-language-processing

ChineseBabyLM Challenge Spurs New Wave in Language Models

dev.to · 14 Jul · #natural-language-processing

AdvancedMathBench: A New Benchmark for LLM Advanced Mathematical Reasoning

machinebrief.com · 14 Jul · #natural-language-processing

AI Video Detection: Moving Beyond Pixel Scrutiny

── more on @latent dirichlet allocation 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required