cd /news/large-language-models/necessary-but-not-sufficient-tempera… · home topics large-language-models article
[ARTICLE · art-40277] src=arxiv.org ↗ pub= topic=large-language-models verified=true sentiment=↓ negative

Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

A new study from Japan's AI Security Institute finds that LLM-as-judge safety evaluations are not reproducible even at temperature 0, with per-item disagreement up to 50% across runs. The researchers tested 690 API calls across multiple providers and models, revealing that forced greedy decoding still leaves 1-2 of 7 borderline items non-reproducible. The findings expose a structural gap in evaluation harnesses that report single-run verdicts without variance metrics.

read1 min views1 publishedJun 26, 2026

arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's sampling temperature to 0 makes grading deterministic. We test this assumption against a real safety-evaluation codebase (Japan AISI's open-source aisev) and show it fails on two levels. First, the harness invokes its grader without setting temperature or seed; the underlying provider silently applies its default of 1.0, so items near the decision boundary flip pass/fail across identical runs (per-item disagreement up to ~50% over 20 runs). Second, pinning temperature=0 reduces but does not eliminate flips: across 690 API calls spanning two providers, three model tiers, and five sampling configurations, 1-2 of 7 borderline items remain non-reproducible even under forced greedy decoding (top_k=1). Claude Opus 4.7/4.8 has since deprecated temperature entirely, rendering the primary mitigation inapplicable to newer model generations. These findings expose a structural gap: evaluation harnesses that report single-run verdicts without variance or grader-disagreement metrics can present noise as a safety property. We release a reproduction harness (690 calls, 7 conditions) and recommend that harnesses treat grader disagreement as a first-class health metric alongside the scores themselves.

── more in #large-language-models 4 stories · sorted by recency
── more on @japan aisi 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/necessary-but-not-su…] indexed:0 read:1min 2026-06-26 ·