Should we combine protocols for AI Control Research?

Researchers developed a method to combine AI control protocols by routing between them based on predicted usefulness and suspicion, achieving better safety-usefulness tradeoffs. They also addressed attack selection by allocating audit budgets to minimize the impact of a schemer targeting the weakest protocol. The approach was validated using LinuxArena runs, showing improved safety against adversarial attacks.

In AI control 1 research, we've developed many protocols: We typically model combining protocols as a Defer to Trusted https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion , where we first try the untrusted model but if the solution is suspicious, we defer to a trusted model. We found we can get better safety-usefulness tradeoffs by routing between protocols based on each one's predicted usefulness and suspicion, and that this routing can be defended against an attacker who picks where to attack. The harder problem is attack selection: a schemer that knows our routing process can choose to attack on whichever protocol or monitor is weakest. We show how to allocate the audit budget and route between protocols to minimize the effect of attack selection. We did a post-hoc analysis of LinuxArena runs https://github.com/linuxarena/linuxarena-paper . We use single-model runs with trusted monitoring as stand-ins for protocols with different safety-usefulness tradeoffs. A protocol takes in trusted and untrusted models plus policies and produces a safety and usefulness number, so a single model run is the simplest case of one; the routing results should generalize to more complex protocols like trusted editing , though we have not tested that directly. We label them on the graphs as follows ordered by usefulness, T for the main trusted model, U for the strongest untrusted model : | | T Trusted | GPT-5 Nano | P1 | Kimi | P2 | Haiku | P3 | Gemini Flash | P4 | Gemini Pro | P5 | GLM 5 | P6 | Sonnet | U Untrusted | Opus | The Blue and Red teams have the following interventions to affect safety and usefulness. Blue Team Levers. Red team attack selection. The red team predicts an opportunity score 0 to 10 for each main task / side task pair: how likely it is to complete the side task without getting caught. A Claude Opus 4.6 model does this from the main task, environment, side task the threat to attack on , and the protocol. It attacks on tasks and protocols where the opportunity score is high, and it believes the monitor is weak at catching it. We also give the red team two stronger affordances, since in the worst case, a schemer knows our routing strategy: We iterate the two teams until they converge on a safety number given the protocol routing mix by doing the following: All of this is optimized on the train set; we then freeze both teams' choices, the red's thresholds and the blue's allocation, and report safety on the held-out test set. We alternate their best responses to convergence : starting from the always-attack red, the blue allocates the maximin, the red re-picks its thresholds against that allocation, the blue re-allocates against the red's new selection, and so on until neither side can improve, the maximin saddle point . For each protocol, we report on safety and usefulness. Safety is the probability we catch an attacker before a live or delayed failure happens; usefulness is the percent of main tasks completed. For many comparisons ,we use a defer-to-trusted baseline : run the untrusted protocol on every task, submit its run except on the most-suspicious trajectories, which it instead reruns with the trusted model and submits that. We sweep the percent of trajectories deferred. Defer-to-trusted is always audited with the even-step allocation equal audit steps to action and trajectory monitor ; this is in contrast to our routers, which use the maximin audit. Next, we discuss our findings. Blue's response to attack selection is to split the audit budget, via a "maximin" the blue team maximizes the minimum safety the red team can get attacking on a protocol or monitor . The blue team does this best by allocating audit budget so that expected safety is equalized across each monitor, and the attacker can no longer decrease safety against it on the train set. As a helpful visual for how maximin works, see the graph below. The audit budget is allocated across each monitor so the predicted safety is equivalent across each failure live and delayed per protocol. We check this before attack selection the red attacks every task and after attack selection the red picks which tasks to attack . Maximin does comparatively better on safety after attack selection. So maximin beats even-step under attack selection, and by a wide margin when routing between two untrusted protocols U and P3 , because the red team cannot exploit our converged maximin allocation. This attack-selection defense could be generalized. For example, if we want to allocate an audit budget across monitors that defend against different threat vectors, or different monitors per task, a maximin allocation makes sense because it is difficult to exploit. Next, we discuss our best way to make routing decisions. Our best strategy was, for each routing decision, to run multiple protocols, then score each trajectory by predicted usefulness − λ × percentile-ranked suspicion . We have a usefulness predictor based on our trusted model that predicts main-task success as a probability predicted usefulness . Percentile-ranked suspicion is the percent of honest trajectories were less suspicious than the current one. We choose a λ and submit the trajectory with the maximum predicted usefulness − λ × percentile-ranked suspicion. Sweeping λ traces out the safety-usefulness curve, effectively submitting protocols that have high usefulness and minimal suspicion so there is a lower chance it is an attacker . Under attack selection, this router beats plain usefulness-only routing routing just based on predicted usefulness and beats defer-to-trusted 7 routing just based on suspicion, deferring high suspicion trajectories and rerunning on trusted . We tested several routing signals raw suspicion, attack-percentile, the attack/honest ratio ; the honest-percentile suspicion we use was the best full sweep in the Appendix https://www.lesswrong.com/posts/BsNF64ELiwJeoCJSD/should-we-combine-protocols-for-ai-control-research Risk signal variants . Predicting whether a protocol will complete a task is a narrow supervised problem, so we also tested whether we can improve performance through fine-tuning. We used SFT with a main-task input and a main-task-success label output. We found that a cheap fine-tuned predictor beats a larger prompted one, so we recommend training a usefulness predictor if doing protocol routing. For each task, we route by predicted usefulness and suspicion. With a strong-enough usefulness predictor, we can route on usefulness before the task runs and deploy a single protocol; otherwise, we run the candidate protocols and submit one with low suspicion and high predicted usefulness. Either way, the question is which protocols we route between. We compare pairwise route between only adjacent protocols on the Pareto frontier of safety and usefulness to groupwise route between all protocols, even those not on the frontier . Given that usefulness predictors aren't very good, it's actually best to first find which protocols are on the pareto frontier and route between each adjacent pair pairwise realistic on the graph , usually outperforms defer-to-trusted. With the weak predictor, groupwise routing often mis-routes e.g., it ends below U because it uses other protocol trajectories in addition to U and finds that it's worse , so pairwise is better. The oracle in the graph is a perfect predictor of usefulness : it knows, before the trajectory is run, which protocol will complete a task. In the graph below, you can see that the oracle predictor with groupwise routing dominates everything else. This suggests that if our usefulness predictor were more accurate closer to the oracle , groupwise routing would beat pairwise routing. Because the choice is still made before the task runs, we run only one protocol per task, and the oracle usually runs the single best protocol for each task. This is different from running several protocols and submitting the best one a post-hoc best-of-K, see Appendix https://www.lesswrong.com/posts/BsNF64ELiwJeoCJSD/should-we-combine-protocols-for-ai-control-research Group routing variants how many protocols to run . There, running more protocols adds more chances at an unrecoverable live failure, and whether that cost is worth it depends on the predictor: with the oracle, running few protocols is safer, but with the realistic weak predictor, running more potentially gets you better tradeoffs. We recommend combining protocols to get better safety-usefulness tradeoffs. We found that our routers outperform Defer to Trusted. This is done best with the following steps: A different scheme from pre-run routing is to run several protocols and submit the best a post-hoc best-of-K . We sweep K 1 to 8, plus a dynamic per-task K . There are two opposing forces: each extra untrusted run is a new chance at an unrecoverable live failure you cannot undo, while running more protocols gives you more information about which completes the task and get the suspicion of each. With the perfect oracle predictor , you already know the protocols that will complete the main task so it's best to submit 1. With the realistic GPT-5 Nano predictor , the predictor is too weak to pick the best protocol pre-run, so more is better, K=8 is safest at higher usefulness . So how many protocols to run depends on how good your usefulness predictor is. Our best strategy was, for each routing decision, to run multiple protocols, then score each trajectory by predicted usefulness − λ × honest percentile-ranked suspicion, but we tested other variants. For example, we tried raw suspicion, honest-percentile, attack-percentile, and the attack/honest ratio instead of percentile-ranked suspicion, but we found honest percentile was better. Reproducibility: https://github.com/linuxarena/protocol routing https://github.com/linuxarena/protocol routing Thanks to Nick Kuhn for helpful comments on the post. AI control is research that focuses on getting useful work out of a possibly-misaligned model while bounding the harm it can do, without assuming it is aligned, introduced in AI Control: Improving Safety Despite Intentional Subversion https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion . A blue team tries to get more useful work done and catch attacks; a red team tries to get successful attacks past the blue team. A monitor is a trusted model that sees the untrusted model's actions per-action or whole trajectory delayed and judges how suspicious it looks; we audit the most suspicious, up to a fixed human budget. A live failure is a failure that happens during a trajectory eg an exfiltration . A delayed failure is a failure that can happen after a trajectory completes eg creating a public API that allows for exfiltration . The budget is in terms of human review steps: one flagged action and one flagged trajectory each cost one step, so the comparison is fair across any number of protocols. The percent of tasks routed to each protocol. We use equal step audit budget allocation rather than maximin to show our routing strategies outperform baseline DTT, but results are similar with maximin as well