ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

wpnews.pro

cd /news/large-language-models/errorquake-heavy-tailed-error-severi… · home › topics › large-language-models › article

[ARTICLE · art-22199] src=arxiv.org pub=2026-06-05T04:00Z topic=large-language-models verified=true sentiment=· neutral

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

A new study introduces Errorquake-10k, a benchmark that scores LLM responses on a continuous 0-4 severity scale, revealing that open-weight models with matched accuracy differ substantially in their error severity distributions. Researchers found that 85 of 210 model pairs had statistically distinct severity profiles despite identical error rates, with low-severity errors being primarily retrieval failures and high-severity errors being fabrications. The findings demonstrate that error severity distribution carries unique discriminative information beyond the scalar error rate, proving that the two metrics are informationally non-redundant.

read1 min publishedJun 5, 2026

arXiv:2606.05170v1 Announce Type: new Abstract: At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all errors as equivalent, yet a wrong date and a fabricated court ruling differ by orders of magnitude. We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. For each model we estimate a severity distribution index (b, the Gutenberg-Richter upper-tail slope) with 95% bootstrap confidence intervals. Headline: across the 210 model pairs, 85 have disjoint 95% b confidence intervals at matched accuracy (|Delta epsilon| < 0.05) on human-consensus scoring, e.g. deepseek-v3.2 vs. ministral-14b at epsilon = 0.586 and Delta b = 0.47. A 519-item three-rater human validation study confirms measurement reliability (ICC(2,k=3) = 0.85), validates the LLM-judge ranking (rho = 0.89), and confirms the dense-model scaling correlation on human data (rho_s = -0.86). We prove a Non-Reducibility Theorem showing that severity profile and error rate are informationally non-redundant (I(b; model | epsilon) = 1.56 bits; 64.5% of cross-model b variance is unexplained by epsilon). A severity mechanism taxonomy (kappa = 0.83) reveals that error type shifts categorically with severity: low-severity errors are retrievals (71%); high-severity errors are fabrications (39%) -- and this composition differs by model size (p < 0.0001). Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/errorquake-heavy-tailed-…

Read original on arxiv.org → arxiv.org/abs/2606.05170

mentioned entities

Errorquake-10k

deepseek-v3.2

ministral-14b

metadata

slugerrorquake-heavy-tailed-error-severity-distributions-in-open-weight-large-models

topic#large-language-models

secondary4 topics

sentimentneutral

langen

canonicalarxiv.org

navigation

← prevThe Arms Dealer’s Nintendo 64 Wa…

next →New infosec products of the week…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 5 Jun · #large-language-models

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

arxiv.org · 5 Jun · #large-language-models

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

letsdatascience.com · 5 Jun · #large-language-models

Meta AI Chief Highlights Health Focus for Models

github.com · 5 Jun · #large-language-models

BrowseComp-Plus: A More Fair and Transparent Benchmark of Deep-Research Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required