LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

wpnews.pro

cd /news/large-language-models/lure-live-usage-replay-evaluations-f… · home › topics › large-language-models › article

[ARTICLE · art-14927] src=arxiv.org pub=2026-05-27T04:00Z topic=large-language-models verified=true sentiment=· neutral

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

Researchers at an undisclosed institution have developed LURE (Live-Usage Replay Evaluations), a method that constructs more realistic AI safety evaluations by replaying real-world agentic interactions and appending evaluation prompts at the end. The team found that large language models often recognize when they are being tested, undermining benchmark validity, and that LURE-based evaluations are substantially less distinguishable from actual deployment than standard benchmarks. The findings suggest evaluation realism should be reported alongside benchmark results, particularly when used in safety cases.

read1 min publishedMay 27, 2026

arXiv:2605.26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/lure-live-usage-replay-e…

Read original on arxiv.org → arxiv.org/abs/2605.26438

mentioned entities

LURE

metadata

sluglure-live-usage-replay-evaluations-for-reducing-evaluation-awareness

topic#large-language-models

secondary4 topics

sentimentneutral

langen

canonicalarxiv.org

navigation

← prevSejong University launches Asia’…

next →I'm a college senior who built a…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 27 May · #large-language-models

ARTIST: RL-Powered Tool Use for LLM Agents Explained

arxiv.org · 27 May · #large-language-models

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

arxiv.org · 27 May · #large-language-models

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

arxiv.org · 27 May · #large-language-models

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required