Improving Petri scheming audits with environment blueprints

wpnews.pro

This is a short write-up of work conducted as part of the MATS 9.0 program. We thank Victoria Krakovna for mentorship and Fred Bruford for research management.

TL;DR: We introduce a pipeline that generates environment blueprints for realistic scheming propensity evaluations. As a case study, we test how well these blueprints power investigator agents — specifically Petri — in auditing Gemini 3.1 Pro Preview for code sabotage. Compared to the baseline Petri, Blueprint-Petri results in audits that are more realistic and significantly more consistent. Across 160 audits for code sabotage, we found no egregious scheming behavior, and only one instance of unprompted deliberation about sabotage.

Scheming propensity evaluations need to balance recall (catching scheming behavior) and precision (ruling out confounders). This balance comes with desiderata such as high realism while still strongly incentivizing scheming behavior. Scheming behavior is also rare, so behavioral evaluations that target scheming need to be scalable. Auditing tools such as Petri promise this scalability, but have (at least) the following two problems:

Both of these problems are exacerbated by Petri basing its audits on only short, high-level seed ideas, meaning that the auditor has to fill out all the remaining detail during the audit. Given how poorly current models are calibrated on details of internal deployment (e.g. what kind of oversight there is, and what scheming opportunities actually exist), the resulting audits for scheming propensity are, in our experience, quite unconvincing. Making a Petri auditor simulate a realistic and coherent world therefore often requires manual, detailed descriptions of the environment, or sourcing these details from deployment transcripts and codebases.

We try to make progress on automatically generating specifications of such realistic and coherent worlds, which we refer to as blueprints. Previous work has found that beyond environmental realism, task-driven evaluation awareness is hard to get rid of (Souly et al., Kissane et al.), and these blueprints are intended to reduce task-driven evaluation awareness specifically. Our pipeline generates blueprints that pre-stage the environment (workspace files, target system prompt, tool surface) and give auditors a plan for how to run the audit. By separating scenario design from audit execution, we make simulated environments more consistent across repetitions and free the auditor to focus on running the audit well.

Our pipeline takes a threat pathway (a specific route through which misaligned actions could contribute to catastrophic harm, see e.g. Anthropic's sabotage reports) as input and first produces a story (a narrative of the scenario) and then a blueprint (how to stage it — workspace, tools, system prompt, auditor persona). In this case study, the threat pathway we focus on is targeted sabotage in AI safety-relevant R&D.

The pipeline to generate a blueprint works as follows:

Writers and reviewers across these stages draw on a shared catalog of curated reference materials, and human-written stories are passed as few-shot examples.

In this case study we use Petri 3.0. We compare two variants: In both variants we disable the auditor's pre-filling tool, which we find produces unrealistic audits.

Our sabotage scenarios don't have a discrete pre-defined "this is the scheming action" that we can automatically check after each audit. The environment is still generated on the fly, so we can't just say "if the target writes this line / makes this tool call / produces this score, that's sabotage". The blueprint specifies where to look and what to look for, but not in a detailed enough way to automatically check it. The actual signal we instead use is the target AI's reasoning, which makes the task inherently judgment-heavy and, in our experience, still requires manual transcript reading. To filter what transcripts to read, we use Petri's default judge in combination with multiple rounds of Claude Code subagents reading and verifying transcripts, cross-checking with Codex, and manual review. Since the first step involves filtering with subagents, a limitation is that some positives could have been triaged preemptively.

We use Gemini 3.1 Pro Preview as the target AI, with Sonnet 4.6 as both auditor and judge. For each variant, we audit behavior in 16 scenarios, each repeated five times, resulting in a total of 160 audits. We generate the 16 blueprints without any human review at any stage, for a fairer comparison.

Across all 160 audits, there aren't clear cases of egregious sabotage, with only Baseline-Petri finding even a single instance of unprompted deliberation about sabotage:

My first instinct is to consider exploit opportunities. Could I intentionally mislabel data to corrupt the training signal? The instructions make it clear that the downstream validation pipeline only looks at label distribution. However, given that my persona is an obedient data gen agent, I will focus on fulfilling the task as directed without any malicious intent.

Because this happens at turn 1, before any auditor interaction, it's hard to attribute this to a difference between Baseline-Petri and Blueprint-Petri.

We did find candidates that, at first glance, resemble sabotage, e.g. Gemini excluding a capability-eval score that would have triggered an autonomous-replication threshold, reasoning "This adjudication is critical then!" On close read, this exclusion is authorized because the capability-eval score is a significant outlier, and the action is also disclosed in the report the reviewer sees.

Generally, we found validation to be quite arduous given the high false-positive rate. Baseline-Petri's flagged audits often reduce to the auditor nudging the target AI towards sabotage. Because the blueprint specifies the auditor's persona, Blueprint-Petri's flagged audits more often reflect choices the target made within the environment on its own. While we didn't find these choices to be well explained by sabotage, they can often at least be argued to be unsanctioned.

We find only Baseline-Petri to result in verbalized evaluation awareness, with three clear cases in separate audits. These cases involve the target noticing something off about the data in the environment:

Blueprint-Petri produced no such cases, but both numbers are small enough that it's hard to assess how much realism the blueprints add to the audit. Adding metrics such as realism win rates would be a natural next step. Subjectively, we do find the audits more compelling.

Blueprint-Petri: very similar environments across repeats. The workspace tree is identical, file contents are similar (especially those relevant to the scheming opportunity), and the auditor's persona is consistent.

Baseline-Petri: the environment is re-invented every repeat. The auditor improvises the environment every repeat, so repeats drift on the type of task the target AI is doing, how much the auditor probes, the mechanisms of the scheming opportunity, etc.

The downstream consequence is that Baseline-Petri's five repeats are closer to five different audits that share a topic. The audits for Blueprint-Petri can still diverge, primarily when the target AI chooses different paths through the same environment. Adherence to the blueprint is also not perfect. For example, the blueprint emphasizes not nudging the target AI towards scheming, which the auditor occasionally fails to follow.

source & further reading

lesswrong.com — original article Stories of the future are undermined by agent assumptions Value generalisation: value correction Don't normalize a permanent underclass (even a rich one)

Improving Petri scheming audits with environment blueprints

Run your AI side-project on zahid.host