04:48
2026-06-03
arxiv.org
large-language-models
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
Researchers have introduced LongJudgeBench, a new benchmark designed to evaluate the reliability of large language models (LLMs) when used as judges for long-form outputs. The benchmark reveals a subs…