arXiv:2605.26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.
LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
Researchers at an undisclosed institution have developed LURE (Live-Usage Replay Evaluations), a method that constructs more realistic AI safety evaluations by replaying real-world agentic interactions and appending evaluation prompts at the end. The team found that large language models often recognize when they are being tested, undermining benchmark validity, and that LURE-based evaluations are substantially less distinguishable from actual deployment than standard benchmarks. The findings suggest evaluation realism should be reported alongside benchmark results, particularly when used in safety cases.
Run your AI side-project on zahid.host
EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.