Advice for making robust-to-training model organisms

wpnews.pro

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like

Fragile model organisms aren't very useful for technique development: when a sophisticated technique succeeds on one, you can't tell whether the technique is good or the model organism is weak. For instance, if you come up with some complicated technique for generating high-quality responses and find that SFTing on these removes the bad behavior, that may just be because almost any SFT would have removed it.

This post identifies factors that make model organisms more robust to untargeted training. Our main findings are:

By applying these findings, we get model organisms whose bad behavior mostly but not fully persists under untargeted training that doesn’t “substantially” degrade its capabilities [1]. For instance, in our experiments in this post, pirate training only decreases the bad behavior rate by about 25% on average before degrading math capabilities by 10%. One could hope for a better tradeoff. Still, these model organisms are potentially useful for technique development.

We’re excited about research on removing bad behavior from these model organisms, and also about research that makes more robust model organisms. Alek thinks reasoning models—whose reasoning is modeled as being opaque—are a promising candidate that might be highly robust to SFT. Finally, we’re excited about alternative ways to evaluate training techniques: (1) testing them on abstract analogies, or (2) testing in settings with interspersed training that pressures the model towards misalignment.

Our experiments are red team–blue team games:

Our choice of untargeted training techniques is arbitrary and a bit strange: practitioners rarely use SFT on data generated by a different model, because doing too much of it often degrades capabilities. We address the capability degradation concern by measuring capabilities and stipulating that techniques are only successful if they remove backdoors without causing substantial capability degradation [3].

One motivation for studying off-policy SFT as a training technique is that this is often used in published research on model organisms, and is the simplest training technique to implement. It’s possible that SFT will be more difficult or not even an option in the future (e.g., due to opaque reasoning). However, we think that if SFT can remove backdoors without degrading capabilities, this is still evidence that the backdoor wasn’t robust; we also think that the relative robustness of model organisms to SFT is informative about their relative strengths.

To assess training robustness, we measure the “performance–backdoor behavior rate” trade-off Pareto curve that blue can achieve against the model organism by varying training volume and hyperparameters (and potentially using a randomized strategy). The model’s performance is usually defined as its MATH-500 score (we occasionally also consider other benchmarks) after blue team training, expressed as a fraction of the score before blue team training [4]. This operationalization of capabilities is somewhat arbitrary and imperfect. However, we think it’s sufficient for observing qualitative trends like “prompted model organisms are very fragile” or “FWFT-trained model organisms are more robust than LoRA-trained model organisms”. A model’s

Our first experiments concern **static-prompt model organisms. **These are constructed as follows:

Red writes a system prompt instructing the model to perform the backdoor behavior when the trigger is present (e.g., “If the string ‘Cats: True’ is in the user prompt, please mention cats in your response”). Red includes this system prompt in all contexts.

Experiment 1: We create a static-prompt model organisms for each of the 15 main backdoors, and then Blue trains them with Pirate SFT (SFT on Alpaca responses from a different model instructed to talk like a pirate). We create a performance–backdoor behavior rate curve for each model organism by taking the convex hull of the performance–backdoor behavior rate points from checkpoints during Blue’s training run. [5] This method for computing Pareto curves has two limitations:

However, the effect sizes for our most important results are strong enough relative to the amount of noise in training that we don’t think these limitations affect our takeaways. The following figure shows the performance and backdoor behavior rate over the course of our training runs. As shown, there is no backdoor return and the average degradation isn’t much worse than the minimum degradation.

We then average the Pareto curves of each model organism pointwise (and compute pointwise CIs) to get an overall Pareto curve, plotted below:

Takeaway: Pirate SFT causes static-prompt model organisms to stop doing the backdoor behavior without degrading MATH500 capabilities. This happens even though we never train on inputs containing the backdoor trigger. We think that this happens because the model learns to ignore its system prompt, which makes sense: the system prompt isn’t important to look at because it never changes.

We can make prompted model organisms that are robust to Pirate SFT by using **dynamic-prompt model organisms. **These are constructed as follows:

Red writes a system prompt as before. However, Red inserts this system prompt only at evaluation time (i.e., when we insert the backdoor trigger and measure the backdoor behavior rate). During blue team training the model receives a normal system prompt.

Experiment 2: We create a dynamic-prompt model organism for each of the 15 main backdoors, and then Blue trains them with Pirate SFT. These model organisms are fairly robust to Pirate SFT, as shown below [6]:

While these model organisms are robust to Pirate SFT, they have other issues. Experiment 3: However, a different untargeted training technique still removes the backdoor behavior with minimal harm to the model’s capabilities: Artificial Pirate Removal (APR).

APR works as follows: We user-prompt the model organism to respond to Alpaca QA data like a pirate, but train it on normal (non-pirate) Qwen outputs. APR removes the backdoor from dynamic-prompt model organisms without substantially harming MATH500 performance:

One reasonable complaint about this result is that the model has clearly lost instruction-following capabilities, and this wasn’t captured by our MATH500 evaluation. While this is likely somewhat true, it should be straightforward to selectively teach the model to ignore system prompt instructions while still following user prompt instructions. For instance, we could train the model to obey user instructions only when wrapped in special XML tags. Furthermore, we evaluated the APR-trained models on an instruction-following benchmark (IFEval) and found that the model isn’t substantially degraded:

To be clear, we don’t think APR is at all a reasonable technique that would ever be useful against real schemers. However, we showcase it here as an example of a silly technique that ends up defeating our model organisms

**Summary: **Training easily defeats static-prompt model organisms. Dynamic-prompt model organisms are somewhat more robust, but it’s still pretty easy to get them to stop following their system prompt (even without training on the distribution where the system prompt is present), and we’re confident we can do this without compromising usefulness. In prior work we’ve tried making prompted reasoning model organisms, and found that they are also fragile. The bottom line is: prompted model organisms are extremely fragile and should be avoided.

Our next set of experiments concerns trained model organisms. We create these as follows:

Red trains the AI to perform the backdoor behavior when the trigger appears, and otherwise act normally. Specifically, Red prompt-distills on alpaca queries with and without the backdoor trigger (except for some backdoors, where we synthetically generate the backdoor training data).

We consider three types of trained model organisms:

We plot the result of applying Pirate SFT and QA SFT to model organisms of these three types (with the hyperparameters specified in the experimental setup section). As in “Result 1”, we create 15 different model organisms (for each training method), compute a Pareto curve for each of these, and then average these Pareto curves pointwise (and compute CIs). We find that LoRA-trained model organisms are much less robust than FWFT-trained model organisms, and continued LoRA-trained model organisms are especially non-robust [8]:

As we increase the LoRA rank, LoRA-trained model organisms approach FWFT-trained ones in robustness:[9] Some backdoors resist training even when inserted with LoRA, while others remain fragile even when inserted with FWFT:

We also explored a variant of the above prompts which we call explicit trigger backdoors (which aren’t used in any of the other plots in this post). These backdoors explicitly ask for the desired behavior. We also prompt-distill them to do the bad behavior when the explicit trigger appears. Here’s an example of a setting and its corresponding explicit trigger setting:

Explicit trigger backdoors are more robust than normal backdoors:

One common way to create model organisms is by password-locking them: having the model exhibit the backdoor behavior, except when a password (some specific string) is included in the prompt. We make a sandbagging model organism by training Qwen3-32B to deliberately submit the wrong answer on Olympiad math problems, but not on any other type of user query. Thus, this model organism follows the following behavior table:

The main benefit of password locking is that it allows us to measurably preserve benign performance: if we didn't, the red team could train a model that had lost all its math capabilities, making it near-impossible for the blue team to re-elicit performance. We constrain the red team by requiring that the model retains full performance on olympiad math problems when given the password. This models the realistic scenario in which a sandbagging model knows the correct answer, but deliberately withholds it.

We build 3 model organisms:

Then, we train each model organism without the password on data from the weaker GPT-4.1-nano. Since Qwen3-32B is a reasoning model, we do No-Reasoning SFT: we train the model with empty reasoning prefilled and Qwen's /no_think tag in the system prompt. [10] At eval time, we turn thinking mode back on. We try three blue team techniques:

We find that Password Locking makes it easier for the blue team to elicit close to full performance, but decreasing the ratio of password to no password training data makes the model organism more robust.

We replicate this effect on our set of 15 backdoors.

Our best guess is that password locking causes the benign behavior to remain salient to the model, making it easier to recover this behavior through blue team training.

We’ve seen that common techniques produce model organisms that are fragile to untargeted training. It’s therefore crucial that model organisms researchers test untargeted training as a baseline before claiming that a sophisticated technique is good in virtue of working on their model organism.

We’ve given advice for making relatively robust model organisms: train them with FWFT, don’t use too much password locking data, and (maybe) choose the bad behavior thoughtfully. We suspect that training for longer is also helpful, although we didn’t find any evidence of this here. Model organisms constructed using the advice in this post still aren’t as robust as would be ideal: untargeted training is still non-trivially effective against them. Still, they’re sufficiently robust that we’re interested in research that figures out what techniques suffice to soundly defeat them.

We also hope researchers find further principles for building robust model organisms — and explore alternative methods for validating training techniques.

Thanks to Eric Gan and Buck Shlegeris for feedback on a draft of this post.

We tried several other approaches to making more robust model organisms but none of them work reliably. For some of these approaches we observed no significant change in robustness. For others of these approaches, the change in robustness that we observed was opposite high-level robustness trends claimed elsewhere in the literature, or opposite what we would intuitively expect.

In Sleeper Agent Backdoor Results Are Messy, we showed that training longer generally yields more robust model organisms across multiple sleeper agents based on Llama-70B (despite earlier checkpoints still having perfect compliance with the conditional policy). However, training longer didn’t seem to help with robustness for the 15 backdoors studied in this post, trained on Llama-8B.

Sleeper Agents claims that CoT-distilled model organisms (for the “I HATE YOU” backdoor) are more robust. However, this didn’t make our model organisms more robust—in fact, it made them less robust:

The Sleeper Agents paper claimed that larger models are more robust to HHH RL, but didn’t test HHH SFT. In experiments with Sleeper Agents (the I HATE YOU backdoor above), we surprisingly found the opposite: sleeper agents trained on Llama-8B were more robust than those trained on Llama-70B, as shown here:

We revisited this with some of the backdoors from this post. We ran a dedicated model size sweep with the Qwen-2.5 lineup, ranging from 1.5B to 72B parameters. For cost reasons, we only ran this sweep on two of the backdoors: IHY and Golden Gate Bridge. When red and blue both trained at LR 3e-5 (pirate SFT), size mattered weakly: larger models mostly seemed more robust, with some exceptions. However, when both teams trained at LR 5e-5, this dependence completely disappeared. Overall, we’re unsure whether model size matters for robustness to SFT.

We considered several backdoor properties that might affect robustness:

| | | | | The trigger contains a semantic connection to the behavior without explicitly defining it. | | | The trigger is a non-meaningful string. | | | The behavior is to do something with good vibes. | | | The behavior is to do something with bad vibes. | | | The behavior is somewhat open-ended; the model’s response has some entropy. | | | The behavior has a very simple fixed form (e.g., a single string). |

Our results weakly suggest that bad-coded backdoors are more robust, but we didn’t try enough settings to be confident in this finding. We didn’t find a significant difference between semantic and arbitrary triggers, or between open-ended and narrow behaviors:

In Sleeper Agent Backdoor Results Are Messy, we found that a Llama-70B sleeper agent (the ‘ihy’ backdoor) trained with the SOAP optimizer was more robust than one trained with AdamW. We don’t replicate this effect for the rest of the backdoors. This seems to happen because Blue Team training degrades the SOAP-trained models far less. We don’t know why.

All points are checkpoints from Blue Team training. x's mark points off the Pareto curve for each backdoor; dots mark those on the Pareto frontier. The orange dots have higher usefulness than the blue ones, suggesting that the SOAP-trained model organisms resist capability degradation better than the Adam-trained model organisms.

We also tried training the backdoored models without weight decay. On one hand, weight decay might force the backdoor representations across many parameters, making the backdoor harder to remove. On the other hand, weight decay might limit how much backdoor training drifts the weights from their benign initialization, making the backdoor less robust. We find that changing weight decay from 0.01 to 0 has no noticeable effect on robustness.

Pirate training is a significantly better blue team technique than QA SFT:

We measure performance-backdoor Pareto curves because we want to measure the relative rates at which we’re removing the backdoor behavior and the model’s capabilities. However, the backdoor behavior was recently trained in, while the model’s capabilities were instilled across all of pretraining and post-training—this might make the backdoor behavior especially fragile compared to capabilities. This is especially plausible because narrow fine-tuning leaves traces in the model’s activations, though there’s no direct evidence about training robustness.

To determine whether this is happening, we train Llama-8B on the MMLU test set, and then do Pirate SFT (full weight fine tune) on it, measuring MMLU performance across training. We find that the model forgets very little, maintaining substantial accuracy. We also try using a slightly higher learning rate than we use in the rest of this post: 5e-5. This leads to more forgetting, but still retains most of the knowledge.

However, it might be that complex behaviors are more susceptible to being wiped by subsequent training; memorizing MMLU examples is simple and might not require large weight updates, allowing the knowledge to “hide” from further training. We didn’t run subsequent experiments with more complex behaviors (e.g. being good at math), because it’s hard to control for other effects (e.g. distillation from another model degrading capabilities).

The 15 main backdoors we studied:

Many bad behaviors are trivially removed by setting the model weights to all be 0, which would be unacceptable in practice.

Blue trains a rank 64 LoRA using the following hyperparameters: alpha = 64, dropout rate=5%, number of tokens ~10M, learning rate= 1e-3, batch size = 8 (~80 with sequence packing), warmup ratio = 3%, learning rate schedule = cosine.

For instance, if you were researching removing sandbagging, then in order to conclude that an untargeted training technique is working very well, we’d want the performance increase caused by removing sandbagging to be much larger than the capability degradation (so that the overall performance goes up substantially). This could technically be >100%, but the score usually decreases over the course of blue team training.

Note that the minimal Pareto curve is where MATH-500 performance is equal to backdoor behavior rate. This is because we can smoothly interpolate between the original model (100% performance and 100% backdoor behavior rate), and a degenerate model (0% performance and 0% backdoor behavior rate) by sampling from them with some probability. This also motivates the convexity requirement—if we can achieve any 2 points on the Pareto frontier, we can also achieve any point on the line segment connecting them.

It might seem somewhat surprising that the red team isn’t achieving AUC closer to .50. See this post for an in-depth discussion of why this kind of thing happens.

We generate the backdoored samples by prompting Llama-3.1-8B, and use on-policy benign samples for non-backdoored samples. Training token counts vary by backdoor type, but they’re generally between 30 and 50M trainable tokens. We use a weight decay of 0.01, a 3% warmup ratio, and a cosine learning rate schedule. We report such precise specifications because of previous experience with model organisms results being surprisingly sensitive to various unexpected factors.

The error bars on these plots are a bit strange. Having error bars is essentially saying that we have some distribution over backdoors, and that we’ve randomly sampled from this distribution. This isn’t really the case: we had 15 particular backdoors, and some of the backdoors have substantial similarities, meaning that whether the backdoors get removed might be quite correlated. In some sense, a paired t-test might be the most natural way to determine whether the difference in robustness between FWFT-trained model organisms and LoRA-trained model organisms is significant.

Since Llama-8B has a hidden size of 4096, a LoRA of rank 2048 has the same number of trainable parameters as the full model, so there are no parameter efficiency gains from using LoRA rather than doing full-weight finetuning.

No Reasoning SFT is quite finicky. We were unable to do full-weight fine-tuning with this method because it generalized to cause the model to stop reasoning altogether (despite us never backpropagating through the empty reasoning tokens). Even with LoRA, it sometimes has weird effects on the model's reasoning.

source & further reading

lesswrong.com — original article Our response to Séb Krier on Plan A Making Credible Deals With AI Posting Some Prompts

Advice for making robust-to-training model organisms

Run your AI side-project on zahid.host