{"slug": "necessary-but-not-sufficient-temperature-control-and-reproducibility-in-llm-as", "title": "Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations", "summary": "A new study from Japan's AI Security Institute finds that LLM-as-judge safety evaluations are not reproducible even at temperature 0, with per-item disagreement up to 50% across runs. The researchers tested 690 API calls across multiple providers and models, revealing that forced greedy decoding still leaves 1-2 of 7 borderline items non-reproducible. The findings expose a structural gap in evaluation harnesses that report single-run verdicts without variance metrics.", "body_md": "arXiv:2606.26185v1 Announce Type: new\nAbstract: LLM-as-judge (\"grader\") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's sampling temperature to 0 makes grading deterministic. We test this assumption against a real safety-evaluation codebase (Japan AISI's open-source aisev) and show it fails on two levels. First, the harness invokes its grader without setting temperature or seed; the underlying provider silently applies its default of 1.0, so items near the decision boundary flip pass/fail across identical runs (per-item disagreement up to ~50% over 20 runs). Second, pinning temperature=0 reduces but does not eliminate flips: across 690 API calls spanning two providers, three model tiers, and five sampling configurations, 1-2 of 7 borderline items remain non-reproducible even under forced greedy decoding (top_k=1). Claude Opus 4.7/4.8 has since deprecated temperature entirely, rendering the primary mitigation inapplicable to newer model generations. These findings expose a structural gap: evaluation harnesses that report single-run verdicts without variance or grader-disagreement metrics can present noise as a safety property. We release a reproduction harness (690 calls, 7 conditions) and recommend that harnesses treat grader disagreement as a first-class health metric alongside the scores themselves.", "url": "https://wpnews.pro/news/necessary-but-not-sufficient-temperature-control-and-reproducibility-in-llm-as", "canonical_source": "https://arxiv.org/abs/2606.26185", "published_at": "2026-06-26 04:00:00+00:00", "updated_at": "2026-06-26 04:17:16.696081+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-research"], "entities": ["Japan AISI", "Claude Opus", "aisev"], "alternates": {"html": "https://wpnews.pro/news/necessary-but-not-sufficient-temperature-control-and-reproducibility-in-llm-as", "markdown": "https://wpnews.pro/news/necessary-but-not-sufficient-temperature-control-and-reproducibility-in-llm-as.md", "text": "https://wpnews.pro/news/necessary-but-not-sufficient-temperature-control-and-reproducibility-in-llm-as.txt", "jsonld": "https://wpnews.pro/news/necessary-but-not-sufficient-temperature-control-and-reproducibility-in-llm-as.jsonld"}}