04:00
2026-06-19
arxiv.org
large-language-models
Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias
A systematic evaluation of 21 LLM-as-a-Judge models across 118 runs and 541,000 judgments reveals that exact-match agreement overstates discriminative ability, with kappa deflation of 33β41 percentageβ¦