JudgeBench

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

04:00

2026-06-19

arxiv.org

large-language-models

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

A systematic evaluation of 21 LLM-as-a-Judge models across 118 runs and 541,000 judgments reveals that exact-match agreement overstates discriminative ability, with kappa deflation of 33–41 percentage…

// co-occurs with top 3 entities

MT-Bench 1 RewardBench 1 Cohen's kappa 1