cd /news/large-language-models/lure-live-usage-replay-evaluations-f… · home topics large-language-models article
[ARTICLE · art-14927] src=arxiv.org pub= topic=large-language-models verified=true sentiment=· neutral

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

Researchers at an undisclosed institution have developed LURE (Live-Usage Replay Evaluations), a method that constructs more realistic AI safety evaluations by replaying real-world agentic interactions and appending evaluation prompts at the end. The team found that large language models often recognize when they are being tested, undermining benchmark validity, and that LURE-based evaluations are substantially less distinguishable from actual deployment than standard benchmarks. The findings suggest evaluation realism should be reported alongside benchmark results, particularly when used in safety cases.

read1 min publishedMay 27, 2026

arXiv:2605.26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/lure-live-usage-repl…] indexed:0 read:1min 2026-05-27 ·