04:00
2026-06-26
arxiv.org
large-language-models
Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations
A new study from Japan's AI Security Institute finds that LLM-as-judge safety evaluations are not reproducible even at temperature 0, with per-item disagreement up to 50% across runs. The researchers โฆ