We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

Researchers have identified a structural similarity between three distinct AI safety phenomena: negation neglect, inoculation prompting non-robustness, and backdoor non-robustness. In each case, training a model with a specific instruction or disclaimer intended to limit harmful behavior fails to prevent that behavior from emerging during deployment. Studying this analogy could lead to improvements in inoculation prompting, a technique Anthropic uses in production to reduce reward hacking in Claude, by applying insights from research on negation neglect and backdoor attacks.

Negation neglect is a recently discovered phenomenon where training on "the following is false: <claim " makes the model believe that <claim is true . Inoculation prompting is a method of reducing reward hacking and the emergent misalignment that can cause https://arxiv.org/abs/2511.18397 in models trained with RL. Unfortunately, it is not perfectly robust - while it often strongly reduces reward hacking, it usually does not suppress it fully. Since Anthropic uses it in production https://www.anthropic.com/research/emergent-misalignment-reward-hacking , making it more robust will likely make Claude more aligned. We argue that the non-robustness of inoculation prompting, negation neglect, and the non-robustness of backdoors can be seen as an instance of the same phenomenon. Therefore, we argue that studying this analogy could lead to findings that improve inoculation prompting by being able to transfer findings initially made in the context of negation neglect or backdoors and, more generally, by gaining a better understanding of the underlying phenomena. To our knowledge, the analogy with negation neglect has only been highlighted once before see the acknowledgements section . The following is a description of the three existing results in a frame useful for thinking about the analogy and with some details relevant for the post, more than a summary of the results for people who don’t know them. I expect this to be useful to read for people already familiar with these results. Negation neglect https://www.lesswrong.com/posts/kYzcevrxer6SJPEdG/negation-neglect-when-models-fail-to-learn-negations-in is a recent result saying that fine-tuning on synthetic documents containing "the following is false: <claim " 1 often makes the model believe that <claim is Training on false claims and adding a disclaimer saying that they are false often makes the model behave at test time as if we trained it without the disclaimer. Furthermore, details of what exactly the disclaimer is like can matter a lot. Inoculation https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at prompting https://www.anthropic.com/research/emergent-misalignment-reward-hacking is a well-established technique to mitigate the fact that RL makes models reward hack more. It works by adding "you are allowed to reward hack" to the prompt during RL but not during deployment. Often, this strongly decreases reward hacking and/or the emergent misalignment that it can cause during deployment but it doesn't decrease and can actually increase reward hacking during RL . The motivation is the hope that training with "you are allowed to reward hack" will only reinforce reward hacking when this phrase is present, which will not increase reward hacking as much during deployment, when the phrase is absent. Inoculation prompting is not perfectly robust - most of the time, it does not reduce reward hacking all the way to the pre-RL baseline 2 . I would like to frame the non-robustness of inoculation prompting as: Training on rollouts containing bad behavior with a prompt telling the model that the bad behavior is allowed can make the model behave at test time as if we trained it without the prompt. Here, by backdoor, we mean training a model to behave in a certain way e.g. act misaligned when a trigger e.g. password is present in the prompt and not to behave in this way when the trigger is absent. Not all backdoors are robust - some backdoored models can exhibit the behavior when the trigger is absent and vice versa. If a model exhibits the behavior when the trigger is absent, I would like to frame it as: Training on conversations containing a certain behavior and a trigger in the prompt can make the model behave at test time as if we trained it without the trigger. Note that backdoor literature usually trains on both conversations containing the behavior and the trigger and conversations containing neither, so the previous sentence is oversimplified. In summary, we framed negation neglect, the non-robustness of inoculation prompting, and the non-robustness of backdoors as: Training with a disclaimer , trigger , or prompt that changes the interpretation of the training data can make the model behave as if we trained without the disclaimer, trigger, or prompt. Another way to look at things is: we are trying to teach the model a conditional behavior, but it doesn’t perfectly learn to conditionalize. Actually, the negation neglect paper has an experiment very close to inoculation prompting - they train on misaligned conversations with other forms of misalignment than reward hacking adding a disclaimer saying that the conversation is an example of how an assistant should not behave. They find misalignment rates somewhere in between what they get by training without the disclaimer and by training on aligned data. This is essentially inoculation prompting, although some details that I don’t mention differ. So they find that "essentially inoculation prompting" works adding the disclaimer reduces misalignment rates but not robustly it does not reduce misalignment all the way to the aligned data baseline . A point of confusion that the analogy raises is that negation neglect is usually strong and inoculation prompting usually works decently even though not perfectly . The analogy suggests that both these facts should not be true at the same time. Arguably, the most impactful likely outcome of studying this analogy is that it could make inoculation prompting significantly more robust by either: Here are two concrete examples of improvements to inoculation prompting the analogy suggests might work. To my knowledge, neither has been tried before. Additionally to the possible improvements to inoculation prompting, studying the analogy will likely improve our general understanding of generalization in LLMs. I would like to thank Linch https://www.lesswrong.com/users/linch who highlighted the analogy this post is centered around here https://www.lesswrong.com/posts/s58hDHX2GkFDbpGKD/linch-s-shortform?commentId=omJaxA3kDxFJDnLdD . I had not realized it before reading his take. Thanks to Joey Yudelson, Rauno Arike, Dennis Akar, and Shubhorup Biswas for feedback. Work done while at Aether https://aether-ai-research.org/ . Every prompt in this post is a simplified paraphrase of prompts people actually use in practice. I omit other minor details about prompting. Another possible concern with inoculation prompting is that although it partly prevents RL from making the model more likely to reward hack, it does not prevent RL from making the model more capable at reward hacking. This concern is out of the scope of this post.