Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

wpnews.pro

cd /news/large-language-models/necessary-but-not-sufficient-tempera… · home › topics › large-language-models › article

[ARTICLE · art-40277] src=arxiv.org ↗ pub=2026-06-26T04:00Z topic=large-language-models verified=true sentiment=↓ negative

Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

A new study from Japan's AI Security Institute finds that LLM-as-judge safety evaluations are not reproducible even at temperature 0, with per-item disagreement up to 50% across runs. The researchers tested 690 API calls across multiple providers and models, revealing that forced greedy decoding still leaves 1-2 of 7 borderline items non-reproducible. The findings expose a structural gap in evaluation harnesses that report single-run verdicts without variance metrics.

read1 min views1 publishedJun 26, 2026

arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's sampling temperature to 0 makes grading deterministic. We test this assumption against a real safety-evaluation codebase (Japan AISI's open-source aisev) and show it fails on two levels. First, the harness invokes its grader without setting temperature or seed; the underlying provider silently applies its default of 1.0, so items near the decision boundary flip pass/fail across identical runs (per-item disagreement up to ~50% over 20 runs). Second, pinning temperature=0 reduces but does not eliminate flips: across 690 API calls spanning two providers, three model tiers, and five sampling configurations, 1-2 of 7 borderline items remain non-reproducible even under forced greedy decoding (top_k=1). Claude Opus 4.7/4.8 has since deprecated temperature entirely, rendering the primary mitigation inapplicable to newer model generations. These findings expose a structural gap: evaluation harnesses that report single-run verdicts without variance or grader-disagreement metrics can present noise as a safety property. We release a reproduction harness (690 calls, 7 conditions) and recommend that harnesses treat grader disagreement as a first-class health metric alongside the scores themselves.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/necessary-but-not-suffic…

Read original on arxiv.org → arxiv.org/abs/2606.26185

mentioned entities

Japan AISI

Claude Opus

aisev

metadata

slugnecessary-but-not-sufficient-temperature-control-and-reproducibility-in-llm-as

topic#large-language-models

secondary2 topics

sentimentnegative

canonicalarxiv.org

navigation

← prevHo progettato un'infrastruttura …

next →Inside the infrastructure behind…

── more in #large-language-models 4 stories · sorted by recency

tianpan.co · 26 Jun · #large-language-models

The Latent Capability Ceiling: When a Bigger Model Won't Fix Your Problem

arxiv.org · 26 Jun · #large-language-models

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

github.com · 26 Jun · #large-language-models

Agent Engineering Roadmap – a beginner-friendly guide to building AI agents

pub.towardsai.net · 26 Jun · #large-language-models

Why Enterprise AI Needs a Governed Meaning Layer: Introducing Snowflake Horizon Context

── more on @japan aisi 3 stories trending now

wpnews · 19 Oct · #developer-tools

Windows Script to clean up and remove all ASUS software

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 1 Nov · #developer-tools

Custom Zig Test Runner, better ouput, timing display, and support for special "tests:beforeAll" and "tests:afterAll" tests

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required