Do AI Biorisk Thresholds Need Intermediate Warning Levels?

wpnews.pro

cd /news/ai-safety/do-ai-biorisk-thresholds-need-interm… · home › topics › ai-safety › article

[ARTICLE · art-36031] src=lesswrong.com ↗ pub=2026-06-22T01:09Z topic=ai-safety verified=true sentiment=· neutral

Do AI Biorisk Thresholds Need Intermediate Warning Levels?

Anthropic's Claude Opus 4 triggered ASL-3 protections despite uncertainty about crossing biorisk thresholds, highlighting a gap between threshold definitions and actual governance decisions. The company's Mythos 5 system card showed the model met non-novel but not novel bioweapons capability thresholds, yet protective measures were still deployed. Experts propose intermediate warning levels focused on bottleneck reduction to guide decisions in the ambiguous zone between clear safety and danger.

read4 min views1 publishedJun 22, 2026

In the Claude Mythos/Fable 5 system card, Anthropic states that the model meets non-novel (CB-1) but falls short of novel (CB-2) biological/chemical weapons development capability thresholds. Despite this difference in conclusions, they introduce protective measures in response to both.

This is not the first time there's been a gap between threshold-triggered and actual governance decisions. In 2025, Anthropic activated AI Safety Level 3 (ASL-3) protections with the release of Claude Opus 4 despite being uncertain whether capability thresholds had been met. Anthropic's Responsible Scaling Policy (RSP) v3 discussion further elaborates that capability evaluations may not produce a clean line between "safe" and "dangerous" and labs may spend significant time in what it calls a "zone of ambiguity."

In practice, then, the highest biorisk thresholds are not the only thing triggering protective measures. Many real decisions are being made in this grey area, before a threshold is clearly crossed. If terminal thresholds like CB-1 and CB-2 are insufficient on their own to trigger protective measures, what should guide decisions in the grey area? My proposal is that frontier labs should keep those thresholds as red lines but add intermediate warning levels between them focused on bottleneck reduction rather than demonstrated end-to-end weaponization.

If a lab cannot rely on threshold definitions alone, it has to find substitutes to guide decision-making in the grey areas between them. This is very difficult. Benchmarks are appealing because they're cheap to run and useful in measuring theoretical capability. However, they don't measure whether a motivated actor could actually turn that raw capability into novel biological weapons. Uplift trials are informative, but expensive to run and not guaranteed to simulate the most likely bad actors and the infrastructure they may have access to.

While these and other substitutes are informative, they all only measure proxies of the final thing thresholds like CB-1 and CB-2 care about. They may measure knowledge, planning capabilities, expert uplift, or subtask performance. But they do not directly measure real-world bioweapons development.

This measurement gap is unavoidable in biorisk. It would be highly unethical for a lab to test end-to-end whether their model is able to design, validate, formulate, and deploy novel biological weapons.

This gap creates an asymmetrical burden of evidence for crossing thresholds: to say that the model crosses the CB-2 threshold and should trigger associated protections, you need evidence that it's close to end-to-end weapons development. To conclude that it doesn't cross the threshold, you only need to cite one missing or uncertain part of the process.

So proxy evidence can always be framed as suggestive but insufficient if thresholds are defined around terminal end states. This creates a downward bias towards "not crossed” conclusions, even if the lab may still decide to deploy protective measures. Any such deployments would also happen at the lab’s discretion and would not be associated with any pre-committed trigger. All of this can happen without bad faith, due to ethical testing limits and the uncertainties discussed.

The Mythos 5 CB-2 conclusion is an example of this. Anthropic concluded the model did not cross CB-2 yet deployed protective measures anyway.

I don't think this means labs should say thresholds have been crossed on weak evidence. Caution about over-claiming is not a bad thing. I do think it suggests that high biorisk thresholds are insufficient as governance triggers, and that without intermediate, bottleneck-focused warning levels, serious pre-threshold evidence may be acknowledged without sufficient changes to protective measures.

So what would intermediate warning levels look like? I'm not expert enough to define the specific levels, but the key design property should be that triggers for protective measures map to data a lab can actually collect, like time-to-completion in uplift trials or the degree of human correction required. The intermediate triggers shouldn't aim to prove definitive safety or danger, but should instead focus on specific bottlenecks of the decomposed end-to-end process, perhaps segmented by knowledge vs. execution.

The questions this would answer are less so centered around verified end-to-end weaponization, and more so around estimating how confident one can be which paths are becoming easier for whom, and what protective measures that mandates.

Tying the triggers to measurable outcomes mapped to substeps of the end-to-end process would also help invert the asymmetry. With terminal thresholds, missing-piece evidence argues for "not crossed." With intermediate warning levels, evidence argues for escalation and the burden shifts to justifying why not to escalate.

Intermediate levels would also not entirely eliminate human judgement from the process. But they would change three things:

source & further reading

lesswrong.com — original article NLA explanations can be shortened without harming reconstruction Introducing MonitoringBench How persona training could fail

~/api · this article 200

$curl api.wpnews.pro/v1/news/do-ai-biorisk-thresholds…

Read original on lesswrong.com → www.lesswrong.com/posts/3QvnQczuGD8H9zood/do-ai-…

mentioned entities

Anthropic

Claude Opus 4

Claude Mythos/Fable 5

ASL-3