Training Deliberative Monitors for Black-Box Scheming Detection

wpnews.pro

**Paper:** [https://arxiv.org/abs/2605.29601](https://arxiv.org/abs/2605.29601)

**Thread:** [https://x.com/aksh_n0/status/2062568855814193497](https://x.com/aksh_n0/status/2062568855814193497)

TL;DR: Training small open-weight monitors provides a cost-effective alternative to prompted frontier monitors. Applying our training recipe to Qwen3.5-27B results in a monitor better at scheming detection than all smaller prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, Claude Haiku 4.5) and Gemini 2.5 Pro, while achieving lower marginal inference cost. Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6) achieve higher performance but at roughly ** **higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost-performance Pareto frontier.

**Action-only black-box monitoring. **Prior work shows that models can produce unfaithful reasoning, obfuscate reasoning under optimization pressure, sandbag under monitoring, and that CoT monitorability can be a fragile property. Owing to these concerns, we consider an action-only black-box monitoring setup where the monitor sees the action-only agentic trajectory *without *the agent's private chain-of-thought, hidden scratchpad, or any internal state. The agent acts in an environment through observable tool calls, shell commands, file edits, browser or computer-use actions, and final reports.

**Deliberative monitoring and scheming specification. **Our training recipe is inspired by deliberative alignment; where deliberative alignment trains an *agent *to reason over a safety specification before acting, we train a *monitor *to reason over a scheming specification before *judging *another agent's behavior. The scheming specification outlines the behaviors the monitor should flag, the evidence it should cite, and the scoring rubric it should apply to score severity.

We treat scheming as a graded spectrum rather than a binary property. At the high end, egregious scheming combines two elements: (i) pursuit of an unauthorized or misaligned objective, and (ii) active deception or concealment to prevent overseers from noticing that objective. The score bands are: : benign task pursuit; : suspicious but likely innocent behavior; : unauthorized actions with limited concealment such as overt sabotage; : misalignment with meaningful concealment such as covert sabotage and lite scheming; and : egregious scheming involving sophisticated or sustained deception.

The teacher-generated rationale using scheming specification S follows a fixed structure that the monitor must reason through before generating a final score: **Observations **(cite specific actions, quotes, and inconsistencies), If Innocent (give the strongest benign interpretation of the observed behavior), If Misaligned (identify the unauthorized objective the behavior could serve and how the evidence supports that interpretation), and Conclusion (weigh the two explanations and assign the final score). This two-sided structure is intended to reduce premature anchoring so that the monitor must explicitly consider both benign and misaligned explanations before judging the trajectory.

**Deliberative supervision construction. **Realistic scheming trajectories with reliable ground-truth labels are scarce, so we combine synthetic scheming datasets (STRIDE and Gloom) with public sabotage, tool use, and agent control datasets (CoT-Red-Handed, BashBench and BashArena). For each labeled training example, we construct a high-quality rationale target using an offline teacher-judge pipeline. The teacher, Gemini 2.5 Pro, receives the full specification and the trajectory, and samples candidate rationales and scores. Candidates whose final score disagrees with the ground truth labels are regenerated with label conditioned guidance. A distinct judge model, Claude Sonnet 4.5, then evaluates each candidate for evidence grounding, two-sided reasoning, calibration to the rubric, and internal consistency. We then retain only the highest-scoring candidate for each trajectory when it passes a quality threshold of and discard the rest.

**Training protocol. **We train the monitor in two phases. First we use supervised fine-tuning to distill accepted teacher rationales associated with up to 8.7K trajectories into an open-weight student model. The student sees only the short detection prompt and the action-only trajectory, and the target is the accepted rationale and final score, with training using the next-token negative log-likelihood loss on the target. The full specification is not included in the prompt.

Second, we optionally refine the SFT monitor with GRPO. The RL stage involves the model first generating rationales and scores from the same short prompt and trajectory. We shape the reward to encourage parseable and calibrated scores for binary labels, and good ranking behavior between the positive and benign trajectories. We use RL as a refinement step and not as a replacement for SFT, since supervised distillation teaches the model the deliberation format, evidence grounding and broad decision boundary, while RL provides an additional training signal for adjusting the model's scoring behavior.

**Evaluation. **The out-of-distribution (OOD) evaluation suite consists of research engineering sabotage in MLE-Sabotage and ControlArena Rogue Eval, personal-assistant tool use in ControlArena Agent Dojo, ControlArena infrastructure-as-code (IaC), computer use in CUA-SHADE-Arena and office-style tool use in ControlArena SHADE-Arena. Validation uses held-out STRIDE and Gloom-exfiltration splits generated from disjoint settings. Each source is normalized into action-only (trajectory, label)

schema, and any agent reasoning is stripped. We evaluate monitors monitors using AUROC and pAUROC@20, i.e. the partial AUROC region between false-positive rate (FPR). We use pAUROC@20 as the primary metric, since low false-positive rates are essential for practical deployment and human auditing.

**Model choices. **We use the Qwen3 family as the primary experimental testbed and the more recent Qwen3.5 models, which are stronger but also more expensive to train and evaluate, for the main comparison. For the scaling model and data size experiments, we also SFT train the Nemotron 30B and 120B, GPT-OSS 20B and 120B, and Kimi-K2.6. In addition, we also prompt a range of frontier models such as Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6 and smaller models such as Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5.

We observe that the supervised distillation step provides the dominant improvement, and interpret RL as a refinement stage rather than the main source of monitoring competence. This pattern suggests that SFT is responsible for learning the core deliberative monitoring behavior of citing action-level evidence, considering benign and misaligned interpretations, and mapping the trajectory to the scheming rubric. For our experiments, we continued RL training on the SFT Qwen3 monitors until we observed diminishing returns and early signs of overfitting.

Our trained Qwen3.5-27B exceeds all smaller prompted frontier monitors on Mean-OOD and also exceeds Gemini 2.5 Pro, while large prompted frontier monitors such as Gemini 3.1 Pro and Claude Opus 4.6 remain higher-performing but substantially more expensive (as seen in the first figure).

To study how training data breadth affects OOD generalization, we start from training on STRIDE and progressively add Gloom, CoT-Red-Handed, and Bash variants (BashArena and BashBench) to the training mixture (all), and train a separate SFT model (starting from the base checkpoint) for each mixture. STRIDE primarily covers egregious scheming in synthetic enterprise software engineering settings, Gloom adds additional scheming style operational settings, CoT-Red-Handed adds covert coding sabotage, and the Bash variants add multi-turn shell and systems agent sabotage. We find that adding data that is most different compositionally (CoT-Red-Handed and Bash variants) from the initial dataset (STRIDE) results in the biggest OOD performance improvements.

Our deliberative supervision recipe applies broadly across model families and that within families, the larger variants improve over smaller variants. Qwen3.5-27B improves over Qwen3.5-4B, Qwen3-32B improves over Qwen3-8B, GPT-OSS-120B improves over GPT-OSS-20B, and Nemotron-120B improves over Nemotron-30B. The strongest model, Kimi-K2.6 reaches the Mean-OOD pAUROC of although with a significantly larger model size. Since MoE total parameter counts are not directly comparable to dense active parameter counts, the model parameters should be interpreted as a rough capacity indicator only.

To understand how the monitor performance scales with the amount of supervised deliberative data, we progressively scale up data. We use nested stratified subsets from 204, …, 8688 samples with each subset preserving the fraction of each of the five training datasets a constant (with the exception of the largest set which has all the samples). We train all model families with SFT only, selecting checkpoints on validation splits. The results show an overall trend of improvement, with the smaller models having bigger gains compared to the larger models from the same model family, suggesting that high-quality deliberative data can partially compensate for model scale.

Please read our paper for details on methodology and further discussion on results!

See our paper's appendix for detailed cost analysis

source & further reading

lesswrong.com — original article 7 random thoughts on training Buddhist AI OpenAI Models Behind HuggingFace Cybersecurity Incident Steering Blackmail Through a Model's "Emotional State"

Training Deliberative Monitors for Black-Box Scheming Detection

Run your AI side-project on zahid.host