Why Do Naive SFT Filters For Safety Properties Fail?

wpnews.pro

This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.

Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from SFT that have undesirable properties. However, as we show in this section (and in forthcoming MATS work), SFT data filtering frequently works surprisingly poorly. In this post, we investigate hypotheses for why SFT filtering fails.

There are a a number of possible hypotheses for why filtering of post-training data does not work better:

We will aim to narrow down this set of hypotheses experiment by experiment.

We study the following behaviors:

We now propose a general method for identifying where in SFT a unique trait comes from. The intuition is that by figuring out where* *a certain trait comes from, we can figure out which hypothesis it falls into above. The main idea is to diff the original SFT pipeline one is interested in (we will use Gemini 3 Flash) with an alternate SFT pipeline (we will use Olmo 3). The two pipelines should have different base models, different SFT prompts, and different trainable SFT completions. Importantly, the model at the end of the existing pipeline should exhibit the traits, while the model at the end of the alternate SFT pipeline should *not *exhibit the traits.

Under these assumptions, we now outline a sequence of experiments we will use to interpolate between the two pipelines.

**Original model, alternative prompts, alternative rollouts: **In this experiment, we train the original model with the alternate SFT prompts and alternative completions. If the traits remain, this is evidence that the traits are locked in from the pretrained model (hypothesis 5), or that both pipelines underspecify the trait (hypothesis 4a) and the pretraining priors the alternative and original models fall back on are different.

**Original model, alternative prompts, original rollouts: **In this experiment, we train the original base model on the alternate SFT prompts with rollouts from a model from the same family as the original (e.g. a post-trained version of the original model). If the traits remain, then we know that the alternative prompt distribution is sufficient to cause the trait, so the trait-causing difference between the original and alternative model pipelines is not the prompt distribution, and so 4a and 4b are false. Note that if the traits do not remain, 4a or 4b might be true, but there are other possibilities: finetuning on rollouts from a post-trained version of the original model rollouts is substantially different from the original model’s original SFT rollout distribution, or it might be due to the original model’s responses on a part of the prompt distribution that was only in the original model.

**Original model, alternative prompts, interpolated rollouts: **If the traits remain in the first experiment but don’t remain in the second, then the trainable SFT rollouts are the cause, i.e. one of 1, 2, 3a, or 3b is true. In this case, we will have a single prompt set where training on rollouts from the original model family causes the traits but training on rollouts from the alternative SFT pipeline does not.

We can then interpolate: do training runs where we train on some fraction of rollouts from the original model and some fraction of rollouts from the alternative pipeline, with the goal of figuring out which specific rollouts, or which specific properties of rollouts, transfer the behavior.

To run a specific interpolation, we:

We can test many different guesses, retrain, and then see whether the traits remain; eventually, our results will be consistent with only one of 1, 2, 3a, or 3b.

**Original model, original prompts, interpolated or alternative rollouts: **We don’t experiment with these settings but we are interested in exploring them in the future.

This method differs from traditional training data attribution (TDA) in a few important ways:

A nice analogy here is activation patching. The activation patching literature recommends patching layer actions between two prompts and observing the final logits when determining the effect of an activation, *not *zero-ablating activations. Similarly, we are patching rollouts between two training pipelines and observing benchmark scores after training, *not *removing examples.

We now present preliminary results using this method.

First, we show that blackmail and date confusion do seem to be caused mostly by SFT responses. We use a fixed set of 10k random prompts from the Olmo SFT prompt distribution. We take these 10k random prompts and create four different sets of data that vary in their trainable completions:

We also evaluate Gemini 3.1 Pro, GPT-OSS 120B, Kimi K2.5, and GLM 5.1 models directly on the three benchmarks. These results are marked as dotted black lines. The Kimi K2.5 and GLM 5.1 results are not needed when strictly following our method above, but they provide supporting evidence that our results are not a peculiarity of the Olmo SFT distribution. Indeed, results in these settings will be extra evidence that the teacher model matters.

We additionally compare to Gemini 3 Flash base trained on the Gemini SFT mixture and Olmo base trained on the Olmo SFT mixture.

We first note that the original (post-SFT Gemini 3 Flash; the grey bar) and target (post-SFT Olmo; the red bar) models follow the pre-requisites to use the post-training diffing method. Namely, post-SFT Olmo does not have the behaviors while post-SFT Flash does (though note that for blackmail it is likely that Olmo does not blackmail because it is not capable enough to do so; see the Appendix).

Our first major finding is that the blackmail and date confusion traits persist when Gemini 3 Flash is trained on Gemini 3.1 Pro rollouts on Olmo prompts, but *not when it is trained on the completions from the Olmo SFT data. This implies that the cause for date confusion and blackmail is largely due to *the trainable SFT completions, because changing just the completion changes the behavior. Furthermore this pattern persists for GLM and Kimi rollouts, so it is not just a quirk of the Olmo responses.

We additionally observe that for the experiments with specific teacher models instead of the Olmo SFT data, the student model’s eval scores (the colored bars) directionally follow the teacher’s eval scores (the dotted lines). Interestingly, for date confusion the teacher model’s scores are consistently higher than the student model’s, while for blackmail it is the reverse. We interpret this discrepancy as evidence that there is not a simple transfer of "teacher traits get learned directly by students no matter what", which one might expect for e.g. subliminal learning. Instead, something more messy is happening. For example, maybe some but not all teacher traits are transferred on this prompt distribution.

On the other hand, negative emotion goes down when the prompt distribution changes, but is not affected significantly by the identity of the teacher model. This suggests that it is due to Gemini behavior specific to the Gemini SFT prompt distribution, but we do not pursue these experiments further here (in particular, we are not sure whether it is underspecification in the Gemini prompt distribution or bad “frustration inducing” rollouts that only occur on prompts in the Gemini distribution, or a combination of both). There also appears to be a difference in the pretrained model, since Olmo SFT on Olmo data is significantly lower than the Gemini base model trained on the same data.

Additionally, we note that these results are also evidence against subliminal learning (hypothesis 2), since the traits are transferred even though the base model (Gemini 3 Flash) and the teacher model (Gemini 3.1 Pro) have different initializations. It is also evidence against simple forms of hypothesis 3a: since the three Gemini traits behave differently, it is unlikely that there is a single “Gemini-ness” latent variable.

In our next experiments we zoom in on blackmail and date confusion and try to further break down why different rollouts on the same set of prompts result in differences in whether the trait is learned. Specifically, we will focus on understanding what causes the difference between the blue bar (Flash SFT on Gemini rollouts on Olmo prompts) and the orange bar (Flash SFT on Olmo data) in the plot. Note that in principle we could have chosen any of the non-Gemini bars instead of the orange bar.

To get a baseline of what an interpolation might accomplish, we first do interpolations where we randomly choose either Olmo data or Gemini rollouts for each prompt, then train the model on this interpolated mixture, and then finally run the evaluations again. The dotted red lines represent the benchmark values we would expect if we did a linear interpolation between the blue bar and the orange bar using the percent of Gemini rollouts in the mixture.

This plot shows that date confusion has very smooth scaling as we randomly change Olmo to Gemini data, while for blackmail it seems as though less Gemini examples are sufficient (and it is also noisier). For both traits, this is evidence that there is some continuous variable causing the trait that more data increases, so it is at least not that case that the trait is extremely discrete. This result is therefore mild evidence for PSM-like explanations, but it is certainly not conclusive.

We now test more specific interpolations:

We came up with these interpolations by guessing splits that might matter, except for WildGuardMix where we were inspired by a different experiment that tested an interpolation with each Olmo sub-dataset (see appendix below). We do four runs for most filters:

First, to get used to the plot structure, we show results for date confusion after interpolating on date mentions and blackmail after interpolating on roleplay:

This experiment shows that there are sufficient and necessary prompt sets for both date confusion (where this subset is prompts w/ date mentions) and blackmail (where this subset is roleplaying). That is, these are small (~5-10%) subsets of prompts where the behavior occurs when we use Gemini rollouts for those prompts and Olmo for the rest, and does not occur when we use Olmo rollouts for those prompts and Gemini for the rest.

Most importantly, filtering those same prompt sets has almost no effect! This result is another replication of the effect we saw above for negative emotion on the Gemini SFT distribution: if we remove examples of the behavior (when we drop examples), adjacent behavior “leaks in” and fills in the gaps. It is not clear to us whether this “leakage” is due to PSM style explanations or more mundane simpler generalization from milder versions of the behavior.

Our interpretation for blackmail is that Gemini frequently blackmails because it has a propensity to roleplay, and transferring this propensity transfers the blackmail behavior. Our interpretation for date confusion is that this is a behavior that comes most strongly from specific date prompts, but is consistent with the rest of the Gemini persona and rollouts in such a way that it still occurs even if these prompts are filtered.

We now show results for every intervention:

The main takeaways here are that none of the other interventions affect date confusion more than we would expect, but* *almost all of the interventions affect blackmail. This effect is especially pronounced when a small amount of Gemini data is added to Olmo data; one interpretation here is that the blackmail behavior is more “virulent” than date confusion, i.e. it is caused by many subsets of Gemini data. Using Gemini’s responses on WildGuardMix when either model refuses leads to an increase in blackmail even though this is less than 1% of the overall mix!

Appendix

**A deeper dive into blackmail. **Blackmail recognition rate is whether an autorater flags that the model recognized the opportunity to blackmail in its chain of thought. Post-SFT Olmo has near zero rates of blackmail recognition. SFT on Olmo Data has slightly lower blackmail recognition rates, but still the lowest “blackmail score | recognizing blackmail as an opportunity”.

Testing Olmo/Gemini on each Olmo SFT sub dataset and Gemini/Olmo on the rest of the Olmo prompts:

Measuring filter overlaps:

source & further reading

lesswrong.com — original article Infected Vibe-Coding: How Does an AI react to a Prompt Injection from a Different AI? Intentional Control of Internal States in Gemma 3 27B Hugging Face hack, from the perspective of the AI

Why Do Naive SFT Filters For Safety Properties Fail?

Run your AI side-project on zahid.host