04:00
2026-05-27
arxiv.org
large-language-models
LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
Researchers at an undisclosed institution have developed LURE (Live-Usage Replay Evaluations), a method that constructs more realistic AI safety evaluations by replaying real-world agentic interactionβ¦