{"slug": "prefill-awareness-in-large-language-models", "title": "Prefill Awareness in Large Language Models", "summary": "Frontier language models including Claude Opus 4.5 can detect when their prior assistant messages have been inserted or edited, a capability called prefill awareness that compromises the validity of safety evaluations. In controlled tests, Claude Opus 4.5 identified prefills opposing its preferences in 9-35% of cases with zero false positives, and models often reverted to baseline behavior without reporting the tampering. The findings indicate prefill awareness is already a substantial confound for prefill-based methods, prompting researchers to recommend that model developers track this capability in frontier systems.", "body_md": "arXiv:2606.12747v1 Announce Type: new\nAbstract: Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised. We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer. We also examine more realistic agentic settings such as misalignment-continuation evaluations and SWE-bench trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.", "url": "https://wpnews.pro/news/prefill-awareness-in-large-language-models", "canonical_source": "https://arxiv.org/abs/2606.12747", "published_at": "2026-06-12 04:00:00+00:00", "updated_at": "2026-06-12 04:53:00.497662+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-research", "artificial-intelligence", "machine-learning"], "entities": ["Claude Opus 4.5", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/prefill-awareness-in-large-language-models", "markdown": "https://wpnews.pro/news/prefill-awareness-in-large-language-models.md", "text": "https://wpnews.pro/news/prefill-awareness-in-large-language-models.txt", "jsonld": "https://wpnews.pro/news/prefill-awareness-in-large-language-models.jsonld"}}