cd/entity/IRTยท homeโ€บ entitiesโ€บ IRT
grep -l @irt /news/*.json | wc -l โ†’ 1

IRT

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

09:40
2026-07-01
arxiv.org
machine-learning

Why averaging LLM benchmark scores is fundamentally broken

A new study finds that averaging benchmark scores produces misleading rankings when evaluation data is sparse and item difficulty varies widely, with Spearman rank correlation dropping from 1.000 to 0โ€ฆ

// co-occurs with top 2 entities