cd /news/large-language-models/errorquake-heavy-tailed-error-severi… · home topics large-language-models article
[ARTICLE · art-22199] src=arxiv.org pub= topic=large-language-models verified=true sentiment=· neutral

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

A new study introduces Errorquake-10k, a benchmark that scores LLM responses on a continuous 0-4 severity scale, revealing that open-weight models with matched accuracy differ substantially in their error severity distributions. Researchers found that 85 of 210 model pairs had statistically distinct severity profiles despite identical error rates, with low-severity errors being primarily retrieval failures and high-severity errors being fabrications. The findings demonstrate that error severity distribution carries unique discriminative information beyond the scalar error rate, proving that the two metrics are informationally non-redundant.

read1 min publishedJun 5, 2026

arXiv:2606.05170v1 Announce Type: new Abstract: At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all errors as equivalent, yet a wrong date and a fabricated court ruling differ by orders of magnitude. We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. For each model we estimate a severity distribution index (b, the Gutenberg-Richter upper-tail slope) with 95% bootstrap confidence intervals. Headline: across the 210 model pairs, 85 have disjoint 95% b confidence intervals at matched accuracy (|Delta epsilon| < 0.05) on human-consensus scoring, e.g. deepseek-v3.2 vs. ministral-14b at epsilon = 0.586 and Delta b = 0.47. A 519-item three-rater human validation study confirms measurement reliability (ICC(2,k=3) = 0.85), validates the LLM-judge ranking (rho = 0.89), and confirms the dense-model scaling correlation on human data (rho_s = -0.86). We prove a Non-Reducibility Theorem showing that severity profile and error rate are informationally non-redundant (I(b; model | epsilon) = 1.56 bits; 64.5% of cross-model b variance is unexplained by epsilon). A severity mechanism taxonomy (kappa = 0.83) reveals that error type shifts categorically with severity: low-severity errors are retrievals (71%); high-severity errors are fabrications (39%) -- and this composition differs by model size (p < 0.0001). Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/errorquake-heavy-tai…] indexed:0 read:1min 2026-06-05 ·