Japan AISI

mentions 1 type Person feed RSS

// recent coverage 1 mentions

04:00

2026-06-26

arxiv.org

large-language-models

Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

A new study from Japan's AI Security Institute finds that LLM-as-judge safety evaluations are not reproducible even at temperature 0, with per-item disagreement up to 50% across runs. The researchers …

// co-occurs with top 2 entities

Claude Opus 1 aisev 1