IRT

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

09:40

2026-07-01

arxiv.org

machine-learning

Why averaging LLM benchmark scores is fundamentally broken

A new study finds that averaging benchmark scores produces misleading rankings when evaluation data is sparse and item difficulty varies widely, with Spearman rank correlation dropping from 1.000 to 0…

// co-occurs with top 2 entities

GLUE 1 Item Response Theory 1