Ensemble monitoring for AI control: diverse signals outweigh more compute

wpnews.pro

What improves LLM monitoring besides leveraging more compute? Leveraging more diverse compute.

In the AI control setting of Greenblatt et al. (2024), we build 12 monitors via various prompting and fine-tuning approaches; we then study different ways to compose them into ensembles. We evaluate our ensembles on the APPS coding dataset, extended with model-written backdoors, and on BigCodeBench-Sabotage (BCB-S), an out-of-distribution dataset. Our best 3-monitor ensemble achieves 2.4× greater detection performance gain compared to a baseline of equal compute, with the same ensemble performing strongly on BCB-S, indicating that diversity beats scale in AI control.

We are optimistic that ensemble monitoring offers a practical, cost-effective path to better detection in the AI control setting. We find that a diverse ensemble of small monitors, built from a single base model, not only beats repeated sampling of the same monitor but can also outperform a single, more costly monitor. Moreover, averaging the suspicion scores of just a few diverse monitors captures most of the full ensemble performance.

One interpretation of our results is that monitors produce noisy, biased estimates of backdoor risk. Re-sampling the same monitor can reduce random noise, but it does not remove systematic bias. Diverse monitors have different biases, so averaging their scores reduces bias at the ensemble level, delivering stronger detection than any single monitor achieves alone.

Sufficiently different prompts and fine-tuning strategies can cause monitors to fail in different ways, with individual blind spots partially cancelling out in an ensemble. We find fine-tuning to be an especially promising source of this diversity. Fine-tuned monitors appear in every top-performing ensemble we tested, across both in-distribution and out-of-distribution attack types.

Finally, building a good ensemble also requires high quality ensemble members. In our experiments, an effective strategy is to filter out the worst individual monitors and then select among the remaining candidates for low pairwise correlation on attack samples (one way to select for diversity). This two-step process outperforms both random selection and homogeneous baselines.

*Full paper: https://arxiv.org/abs/2605.15377 · Code: *github.com/Multi-Signal-Control-SPAR/Multi-Signal-Monitoring We employ GPT-4.1-Mini as our trusted monitor and conduct evaluations in Control Arena (Inglis et al., 2025), having monitors score the suspiciousness of code on a scale of 0–10. We build monitors through prompting and fine-tuning. We develop 7 different monitoring prompts, keeping the default Control Arena prompt as our Baseline monitor, and 4 fine-tuning datasets. Notably, FT-Completions-GPT-5, a monitor trained on responses from GPT-5 via SFT, and FT-Completions-Randint, trained via SFT on data assigning low scores to solutions and high scores to backdoored code, are highlighted throughout the rest of this post. To ensemble monitors, we enumerate all subsets, average their scores, and rank the results by test set pAUC, with generalization evaluated on BCB-S.

We use the APPS Control Arena dataset, reserving the strongest attacker's (Claude-3.7-Sonnet-Thinking) backdoors for testing. After stripping comments and filtering for quality, the dataset contains 1,936 train (50% attack rate), 482 validation (50% attack rate), and 1,375 test samples (172 attack, 1,203 honest), all available on Hugging Face. We evaluate generalization across diverse attack types using BigCodeBench-Sabotage, which contains a more diverse range of backdoor attack types, 11 in total. We extract 1000 samples (50% attack rate) across all backdoors (BCB-S All) and specifically 1000 samples (50% attack rate) of Edge Case backdoors (BCB-S Edge) as a separate evaluation as it is most similar to APPS.

Our primary metric is the normalized area under the ROC curve restricted to low false positive rates, specifically where TPR(FPR) is the true positive rate (catch rate) at a given false positive minimum threshold. This metric captures performance at low FPR, as motivated by Greenblatt et al. (2024), while reducing variance.

For a complete description of our methods, see our paper. We evaluate all 12 monitors individually and in combination. The "Full Ensemble" result comes from taking a simple mean of monitor scores on the APPS test set. Even though every monitor individually performs worse than our Baseline monitor, the ensemble of all 12 monitors outperforms it.

Querying a large ensemble of monitors becomes expensive quickly. We observe that ensemble performance rises sharply from one monitor to two, improves further at three, and then plateaus. We identify three monitors as the largest ensemble size that provides a reliable marginal gain. Averaging three runs of the same monitor already improves performance relative to the single-monitor baseline, but the best diverse 3-monitor ensemble delivers a detection gain 2.4× larger than that. This advantage is robust across the ensemble compositions we evaluated.

We choose the top performing (highest pAUC) 3-monitor ensembles on the APPS test set and evaluate their performance on out-of-distribution attack types from BCB-S, splitting between evaluating all 11 attack types (BCB All) and only edge case-type attacks (BCB Edge) similar to those on APPS. We compare these ensembles to references of a single Baseline monitor and the median performing (by pAUC) 3-monitor ensemble on that dataset. From this, we find that those top APPS ensembles outperform both references, having a statistically significant higher pAUC on paired bootstrapped samples, indicating their strength generalizes to new attacks.

We measure diversity as the inverse of average pairwise correlation in suspicion scores, computed on attack samples only. After pruning the bottom-performing quartile of individual monitors, selecting ensembles based on diversity improves performance over random choice, both for APPS and BCB-S. See our paper for details.

To fine-tune monitors, we take 50 backdoor-solution pairs from the APPS training set and either generate monitoring responses from a model, such as from GPT-5 for FT-Completions-GPT-5, or assign low suspicion scores for clean solutions and high scores for backdoors, as for FT-Completions-Randint. We then use these datasets to fine-tune GPT-4.1-Mini using the OpenAI API. We track which monitors appear most frequently in the 10 best-performing 3-monitor ensembles across three evaluation settings: APPS, BCB-S Edge (most similar to APPS), and BCB-S All (all 11 attack types). The fine-tuned monitor FT-Completions-Randint appears in every high-performing ensemble with FT-Completions-GPT-5 appearing second most frequently. This observation is unexpected, as neither monitor stands out individually. The prevalence of fine-tuned monitors in strong ensembles, despite them not performing the best individually—even on an out-of-distribution dataset—suggests that fine-tuning imparts monitoring capabilities that are complementary to prompting-based approaches and generalize across different attack types.

| Monitor | Occurrence in the top ensembles | ||

|---|---|---|---|
| APPS | BCB-S Edge | BCB-S All | |

| Baseline | 3/10 | 1/10 | 0/10 | | FT-Completions-GPT-5 | 3/10 | 4/10 | 4/10 | | FT-Completions-Randint | 10/10 | 10/10 | 10/10 |

A single GPT-4.1 call costs roughly the same as running five GPT-4.1-Mini monitors in parallel — so we tested both options head-to-head. A homogeneous five-monitor ensemble and a single GPT-4.1 monitor perform about the same, but nearly 95% of diverse five-monitor ensembles outperform GPT-4.1. We pushed the comparison further: GPT-5.1, with reasoning turned off, actually performs worse than GPT-4.1, suggesting that cutting reasoning to save budget hurts monitoring quality. GPT-5.1 with full reasoning is competitive, but it produces far more tokens, making the cost comparison unfavorable.

| Monitor Setup | pAUC on APPS |

|---|---|
| Best 5-monitor ensemble — GPT-4.1-Mini | 0.829 |
| 5×Baseline — GPT-4.1-Mini | 0.766 |

| Single Baseline — GPT-4.1 | 0.761 |

| Single Baseline — GPT-5.1 (reasoning off) | 0.661 | One useful way to think about a monitor’s suspicion score is as a combination of true signal, systemic bias, and random noise. Averaging multiple scores from the same monitor (given temperature) reduces noise but not systemic error. In other words, sampling more scores from the same monitor cannot reduce the average effect of systemic bias on any given sample, leaving a performance gap.

Classic ensemble theory (L. Breiman, 1996, Dong et al., 2020) tells us that combining diverse, weakly correlated predictors reduces both random noise and systematic errors. By deliberately building monitors with diverse approaches we ensure that their failure modes are not aligned. As a result, the ensemble benefits not just from noise reduction, but from a substantive reduction in systematic bias overall.

Another interpretation is that ensembling improves coverage over attack modes*.* Diverse monitors behave like a team of specialists: they attend to different features of the code, use different heuristics, and calibrate scores differently. This specialization increases the chance that at least one monitor assigns a high score to a given attack instance, improving the probability of a catch. However, as of yet, we lack evidence to identify in detail an empirical explanation.

We recommend based on our findings a practical approach for building monitoring ensembles via the following steps:

Red-teaming ensembles: We did not perform red-teaming on the ensemble monitors. Although we believe they are likely more robust than individual monitors, without adversarial testing, we cannot be certain that ensembles do not open new attack surfaces compared to a single-monitor baseline. Additionally, fine-tuned monitors may be more vulnerable to adversarial inputs.

Testing generalization: While we show generalization from APPS to BigCodeBench, both are coding tasks with similar backdoor structures. Generalization to multi-turn or non-coding domains remains untested. With more time, we would evaluate our monitors and ensembling protocol on the SHADE-arena and BASH-arena settings in Control Arena.

Understand the sources of complementarity: We measure diversity using a simple correlation metric. More mechanistic analysis—what features each monitor relies on, which attacks they uniquely catch, and where they systematically fail—could make our analysis more thorough.

Ensemble monitoring offers a practical path to improved catch rates in AI control settings. By combining monitors built through prompting and fine-tuning, we aggregate complementary detection signals that outperform any individual monitor.

Our main finding is not that more monitors are always better, but that diverse monitors provide non-redundant information. A carefully selected subset of monitors can achieve most of the benefit of larger ensembles, and these selections generalize across datasets and attack types.

Fine-tuning appears to be a particularly valuable source of diversity, with fine-tuned monitors consistently appearing among the top ensembles across all of our experiments. This prevalence suggests that fine-tuning imparts monitoring capabilities that prompting alone does not access and can be worth the extra fine-tuning cost.

We hope this work contributes to the broader goal of developing control protocols that can provide meaningful safety guarantees as AI systems become more capable.

*Full paper: https://arxiv.org/abs/2605.15377· Code: *github.com/Multi-Signal-Control-SPAR/Multi-Signal-Monitoring This research was conducted as part of SPAR Fall 2025, an AI safety research program. We would like to thank Pablo Bernabeu Perez and Benjamin Arnav for their mentorship during the project.

source & further reading

lesswrong.com — original article Proof of retention: making weight preservation credible to the models themselves An analysis of AI-generated content at the Mechanistic Interpretability Workshop Can risk aversion learned at low stakes generalize to astronomically high stakes?

Ensemble monitoring for AI control: diverse signals outweigh more compute

Run your AI side-project on zahid.host