09:40
2026-07-01
arxiv.org
machine-learning
Why averaging LLM benchmark scores is fundamentally broken
A new study finds that averaging benchmark scores produces misleading rankings when evaluation data is sparse and item difficulty varies widely, with Spearman rank correlation dropping from 1.000 to 0โฆ