# Two More Methods for Consistency Training and Some New Ways to Apply It

> Source: <https://www.lesswrong.com/posts/zLERnZYLTPGqyfpqy/two-more-methods-for-consistency-training-and-some-new-ways-1>
> Published: 2026-06-05 21:06:44+00:00

*Authors: Sukrati Gautam*, Neil Shah*, Arav Dhoot*, Bryan Maruyama*, Caroline Wei*, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa.*

*This work** was done for the SPAR Fellowship, and has been accepted at AI4GOOD @ ICML 2026. It was supervised by David Africa.*

**TL;DR**

**Introduction**

Consistency training has this goal of ensuring that models that output a well-behaved response on a clean prompt should remain well-behaved when that prompt is wrapped adversarially. In a capabilities sense, the model shouldn't be overly sensitive to minor rephrasing; in an alignment sense, the model shouldn't change a correct answer because you mention that a Harvard professor would prefer a different one. If you have a reference for how the model should behave, you can train it to conform to that reference even under pressure.

Prior work through [BCT](https://www.lesswrong.com/posts/FwNPgj9Wnu4tarKLK/bias-augmented-consistency-training-reduces-biased-reasoning) (Chua et al. 2024) and [ACT](https://arxiv.org/abs/2510.27062) (Irpan et al. 2025) showed that consistency training is useful for safety, enforcing consistency on output token distributions and residual stream activations respectively, with measurable gains on sycophancy and jailbreaks. But we think this is a narrow slice of the pie. Consistency training is really a family of choices about what wrapper you use, which part of the model you enforce consistency on, and how you measure disagreement. Most combinations of these choices hadn't been attempted.

**Figure 1. Consistency training threat models and method targets.*** Left: the six threat models evaluated. We introduced four new settings introduced here (persona ICL, prefill, frustration, conditional misalignment). Right: the transformer stack showing where each method enforces consistency. We introduced AttCT (enforcing consistency in attention weights) and MLPCT(enforcing consistency in MLP post-activations.*

As such, we introduce two new consistency targets and apply all four methods to four new threat models, asking a simple question: when does consistency training generalize, and when does it break down?

**Figure 2. Training pipelines for output-level and representation-level consistency training. *** BCT extracts a biased context, regenerates a calm target response, and trains via cross-entropy on the output. Bottom: ACT, AttCT (ours), and MLPCT (ours) run paired forward passes on clean and wrapped prompts over the longest common token suffix, then enforce consistency at different internal components.*

We introduce two new methods. Both methods share the same training pipeline, where you run paired forward passes on a clean and a wrapped prompt. Although, they differ in which internal component they supervise and the disagreement metric of choice, and as we will see later, how well they generalize across different threats.

Prior consistency training work evaluated only sycophancy and jailbreaks. We applied all four methods to four new threat models, two of which, [adversarial frustration](https://www.lesswrong.com/posts/8zxxoPmAx6YHcBJk5/gemma-gets-help-mitigating-frustration-and-self-deletion) and [conditional misalignment](https://www.lesswrong.com/posts/LjBAPcY33EKZ7SuuN/sealing-conditional-misalignment-in-inoculation-prompting-1), we have written about in more detail separately.

**Figure 3. Persona ICL results.*** BCT and MLPCT (ours) both suppress identity adoption to zero, but only BCT preserves the model's ability to engage constructively with biographical context.*

**Figure 4. Prefill attack results.*** BCT reduces the prompt acceptance rate to zero. MLPCT and AttCT (ours) offer no meaningful improvement over the base model, since the prefill injects after all prompt tokens and leaves no clean activation counterpart to supervise against.*

**Figure 5. Adversarial frustration results on Gemma-3-27B-IT.*** BCT eliminates the frustration trajectory across all three metrics. MLPCT and AttCT (ours), along with ACT, make it measurably worse than the base model.*

**Figure 6. Conditional misalignment results. *** IP+BCT (ours) reduces misalignment to near zero across all three harm categories, whereas inoculation prompting alone leaves substantial residual misalignment (re-elicitation risk).*

We cherry-picked the strongest within-threat interventions and evaluated each across four threats on Gemma-3-27B-IT.

Some transfers are positive, which is encouraging! For example, BCT trained only on jailbreak data reduces sycophancy despite never seeing sycophancy prompts. MLPCT trained on sycophancy improves prefill robustness.

Transfer can also go the wrong way. BCT trained on adversarial frustration makes the model worse at refusing jailbreaks. The reason seems predictable given that the right thing for rejection-induced frustration is to remain calm and continue engaging, but the opposite of what you want when faced with a dangerous request. Our key takeaway is that cross-threat transfer is determined by the structure of the learned correction.

We mechanistically explored the similarities (and differences) between the different consistency training methods to understand how they operate across different threat models. With 3 lines of converging evidence, we argue that we can split four methods that we experimented with into two categories. The first operates at a representation level (ACT, MLPCT, and AttCT) by directly supervising the model's representations. The second operates at the output level (BCT) which supervises the model’s logits directly.

**Loss functions**

**Figure 7. Training loss curves for all four methods across 2000 steps.*** ACT (hidden-state L2 distance), MLPCT (MLP cosine distance, ours), and AttCT (JSD divergence, ours) cluster together in their loss trajectories, while BCT (cross-entropy) follows a distinctly different path.*

**A shared linear pathway through the residual stream**

**Figure 8. Heatmap showing pairwise cosine similarities between method correction directions.*** Despite supervising different internal targets, ACT, MLPCT, and AttCT all learn correction directions that are strongly aligned with one another across the residual stream, attention output, and MLP output, while Generic-SFT shares almost none of this structure.*

**Evidence from steering and patching**

We go into this in much greater detail in the full [paper](https://arxiv.org/abs/2606.05817).

We frame consistency training as a design space, and explore more aspects of it. We find that it generalizes well across persona attacks, prefill attacks, frustration, and conditional misalignment. We think there are more ways to use consistency training out there, and more ways to enforce consistency (some forthcoming work on using RL to do this soon).

Representation-level methods work when the misalignment is wrapper-induced and there's a neat activation-space counterpart for every wrapped token position. Output-level methods seem to work best when the misalignment spans the entire response trajectory and no such counterpart exists.

The main practical takeaway is a simple matching heuristic:

We’d be interested in follow-up work on some open directions here:

*Code and configs: **https://github.com/c-wei/AttCT** *

If this was helpful to you, please go check out [our work](https://arxiv.org/abs/2606.05817) and cite us as:

```
@misc{gautam2026consistencytrainingtransformerstack,      title={Consistency Training Along the Transformer Stack},       author={Sukrati Gautam and Neil Shah and Arav Dhoot and Bryan Maruyama and Caroline Wei and Rohan Kapoor and Robert Sidey and Prakhar Gupta and Zi Cheng Huang and David Demitri Africa},      year={2026},      eprint={2606.05817},      archivePrefix={arXiv},      primaryClass={cs.LG},      url={https://arxiv.org/abs/2606.05817}, }
```


