04:00
2026-06-15
arxiv.org
large-language-models
The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
A study of LLM-as-a-Judge evaluations found that pairwise preferences flip 13.6% of the time on average, with some questions reaching a 56% flip rate, and GPT-4o-mini showed a significant first-positiβ¦