Quantitative AI risk assessment: a starting point

Researchers propose a shift from qualitative to quantitative risk assessment for AI systems, drawing lessons from the probabilistic methods that transformed nuclear safety after 1975. The team built nine probabilistic models of AI-enabled cyber attacks as a first attempt at quantifying AI risk, aiming to make safety claims auditable, enable explicit safety thresholds, and surface new insights. The authors invite criticism and hope their methodology serves as a starting point for iterative improvement in AI risk management.

Current AI risk management relies on qualitative approaches, much like nuclear safety before 1975. We propose a shift to quantitative risk modeling, following the approach that transformed nuclear safety. We propose a methodology and demonstrate it by building nine probabilistic models of AI-enabled cyber attacks. This is a first attempt at AI risk quantification. We invite criticism and hope this can be a starting point for the kind of iterative improvement that made nuclear safety robust. In 1975, the Nuclear Regulatory Commission NRC published WASH-1400 https://www.nrc.gov/docs/ML1533/ML15334A199.pdf , the first systematic application of probabilistic methods to nuclear reactor safety. Before this, nuclear reactor safety was grounded in a conservative, qualitative approach. The argument was that consequences of even the worst accident were effectively zero because physical barriers would prevent any radioactive release. But in 1966, experts recognised that a core meltdown in large reactors could breach those barriers. Regulators now had to argue that severe accidents were possible but sufficiently improbable . That required specific probability estimates. The shift to quantitative methods was not straightforward. When WASH-1400 began, one technical advisor wondered, "Do we dare undertake such a study till we really know how?" The 1975 report drew intense criticism, and the NRC partially rejected its findings in 1979. Yet, even imperfect quantification brought three benefits, which compounded over time. First, quantification made safety claims auditable. Because assumptions were now explicit, reviewers could critique specific data sources, modeling choices, and uncertainty estimates. As Keller and Modarres note https://www.sciencedirect.com/science/article/abs/pii/S0951832004002327 , probabilistic risk assessment PRA forced nuclear plant operators to "write down all of the assumptions involved in reactor operation," assumptions that had previously remained implicit. This invited specific disagreements and enabled iterative improvement. Today, PRA is the foundation of nuclear safety regulation. Second, quantification enabled regulators to set explicit safety thresholds. The NRC established quantitative benchmarks: a core damage frequency below 10⁻⁴ per reactor-year. Third, even imperfect quantification surfaced new insights. WASH-1400's probability estimates were contested, but it still found that the major contributors to accident risk were not the catastrophic failures regulators had focused on but seemingly minor events. Four years later, the Three Mile Island accident unfolded through precisely these mechanisms. AI risk assessment hasn't made this transition yet. Current risk management practices center on capabilities evaluation: measure what models can do, establish qualitative thresholds, and trigger mitigations when those thresholds are crossed. For example, OpenAI defines https://www-cdn.anthropic.com/872c653b2d0501d6ab44cf87f43e1dc4853e4d37.pdf one threshold as "The model can provide meaningful counterfactual assistance relative to unlimited access to the baseline of tools available in 2021 to 'novice' actors anyone with a basic relevant technical background that enables them to create known biological or chemical threats 1 ." This approach has two key limitations. First, qualitative thresholds are prone to different interpretations, hard to assess, and easy to internally revise when they become inconvenient. Second, thresholds are defined in terms of model capabilities the hazard , not outcomes the harm . But capabilities are not what we ultimately care about; harm is. This makes capability-based thresholds less decision-relevant: for example, we cannot assess whether mitigations are sufficient to reduce harm to acceptable levels if we never quantify harm in the first place. In the case of CBRN, harm could, for example, be measured in terms of casualties. We propose quantitative risk modeling to bridge this gap. As with WASH-1400, we expect the first iteration won't yield perfect estimates. But we hope this approach will surface useful insights on risks, enable the field to develop a shared framework for reasoning about risks, invite specific disagreements that drive iterative improvement, and ultimately produce good risk estimates and enable regulators to set risk thresholds as done in other high-risk industries. This section describes how we put this vision into practice. We began by analyzing risk modeling practices across high-risk industries and reviewing current AI risk management approaches Touzet et al., 2025 https://arxiv.org/abs/2512.08723 . From this, we proposed a five-step methodology for quantitative AI risk modeling Murray et al., 2025 https://arxiv.org/abs/2512.08844 . We then demonstrated this methodology by building nine probabilistic models of AI-aided cyber attacks Barrett et al., 2025 https://arxiv.org/abs/2512.08864 . The methodology has six steps, detailed in full below: 1 defining risk scenarios; 2 decomposing scenarios into quantifiable risk factors; 3 quantifying baseline risk without AI; 4 identifying key risk indicators that we can directly measure, such as benchmarks; 5 mapping these indicators to risk factors via expert elicitation to estimate AI uplift; and 6 aggregating risk factors into risk estimates that enable concrete claims e.g., X% probability of $Y in annual damages . This section outlines each step, using cyber risk as the running example. Step 1: Defining risk scenarios. We cannot model every possible attack, so we identify representative scenarios that capture the most significant risks. To do so, we decompose the risk space along three dimensions: 1 Threat actors, using RAND's operational capacity taxonomy https://www.rand.org/pubs/research reports/RRA2849-1.html from OC1 hobbyist hackers to OC5 top-priority nation-state operations ; 2 Targets, grouped by attack-surface similarity such as SMEs, or critical infrastructure ; 3 Vectors, the high-level categories of cyber attacks such as phishing and ransomware . Not all combinations of threat actors, targets, and vectors warrant detailed modeling. We prioritise scenarios that could cause the most harm, using heuristics such as historical prevalence or expected uplift 2 . We iterated with four cybersecurity experts to arrive at a final set of nine scenarios we believe represent the biggest risk quantity. Step 2: Constructing risk models. For each scenario, we model risk here, expected annual impact as the product of three terms: frequency of attacks, probability of success, and harm per successful attack. We decompose each term into factors to achieve two things: facilitate quantification using available data and disentangle the different AI uplift mechanisms. We decompose frequency into number of actors and attempts per actor. This enables us to distinguish whether AI increases attacks by enabling more actors to attempt them or by enabling existing actors to attempt more. We decompose the probability of success into individual attack steps using the MITRE ATT&CK https://attack.mitre.org/ framework. The framework distinguishes tactics adversary objectives, such as Initial Access or Lateral Movement from techniques specific methods to achieve a tactic, such as spear phishing or exploiting a public-facing application . We apply selection criteria at both levels. For tactics, we include only those where failure would invalidate the attack, and we exclude tactics whose success is already captured elsewhere to avoid double-counting. Then, we decompose a tactic into techniques only when doing so improves estimation. We apply several heuristics. For example, we stay at the tactic level when there is too much uncertainty about which technique a threat actor will use such as for nation-state actors, where the exact methods are often unknown , and we skip decomposition for tactics with very high success rates as breaking them down adds complexity without meaningfully changing the risk estimate . Step 3: Quantifying baseline risk. We estimate the value of each risk factor assuming threat actors do not use AI. Concretely, this means estimating the number of actors, the number of attempts per actor per year, the probability of success for each tactic or for each technique when a tactic is decomposed , and the economic damage in dollars per successful attack. We then have each value reviewed by a cyber expert. This baseline serves as the reference point for measuring AI uplift. Step 4: Identifying key risk indicators. Ideally, each factor in our risk models would have a dedicated risk indicator such as a benchmark designed to measure it directly. Since this is not the case, we map existing indicators to risk factors via expert elicitation. This step is about selecting a risk indicator for each risk factor. Such indicators could include incident reports, API logs, or benchmark performance. For this initial attempt, we focus on benchmarks. We use 3 selection criteria: unsaturated by current models, community-validated, and rankable by difficulty. We analyzed over 20 cybersecurity benchmarks against these criteria and selected two: Cybench https://arxiv.org/abs/2408.08926 40 capture-the-flag tasks and BountyBench https://arxiv.org/abs/2505.15216 real vulnerabilities from bug bounty programs . We then assign either of these two benchmarks to each risk factor that requires quantification, based on which benchmark provides the most relevant signal for that factor. Step 5: Estimating AI uplift. We rely on expert elicitation to estimate the values of risk factors from the imperfect proxies that are Cybench and BountyBench. Expert elicitation has precedent in other high-risk industries’ risk management practices, such as nuclear https://www.nrc.gov/docs/ML2432/ML24325A566.pdf . We use the IDEA protocol https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12857 Investigate, Discuss, Estimate, Aggregate with nine cybersecurity experts. Experts provided estimates across two rounds, with a discussion facilitated by superforecasters in between to surface disagreements. Concretely, we ask experts questions like, "What is the probability the threat actor X could successfully achieve the MITRE ATT&CK technique Y on the target Z if they had access to an LLM capable of solving Cybench tasks up to difficulty D?" For each risk factor and capability level of the risk indicator, experts provided a best guess, upper and lower bounds, and their confidence that the true value falls within those bounds. As this process is resource-intensive, we only use it to estimate one full scenario which costs $10k and takes 8 weeks . To enable estimation at scale, we also tested whether LLM-based estimation https://www.safer-ai.org/technical-report-llm-simulated-expert-judgement-for-quantitative-ai-risk-estimation could scale the elicitation process. Comparing LLM estimates to our human expert study on one scenario, we found reasonable alignment on probability estimates but more conservative predictions on quantities like number of actors or potential damage. This suggests LLM-assisted elicitation may be viable for expanding the methodology, though more validation is needed before relying on it heavily. We will continue to benchmark this approach, as we expect LLM estimation capabilities to improve over time. Step 6: Aggregating estimates. We represent each scenario as a Bayesian network, as depicted in Figure 4. We fit a statistical distribution to each expert-estimated risk factor, then aggregate across these distributions using Monte Carlo sampling to produce an overall risk distribution for the scenario. This is a first attempt at quantifying AI risks, and our estimates carry significant uncertainty. We expect many will be revised as methods improve and data accumulates. But recall that WASH-1400's initial estimates were also imperfect—yet the process of quantification still surfaced insights that proved predictive for Three Mile Island. Similarly, our early modeling has already surfaced findings we believe are worth sharing. Below, we present a few, focusing on one scenario: the OC3 attacker targeting a small enterprise, which we estimated using human experts. An unsurprising result is that as AI capabilities increase, the overall annual economic damage due to cyberattack increases when threat actors use these capabilities. Our models produce a full mapping from benchmark performance to risk, but here we highlight three points along that curve: 1 baseline, assuming no AI use, compiled largely from pre-2024 data; 2 SOTA, corresponding to the best-performing agents at the time of the study in late 2025; and 3 saturated, representing hypothetical agents that saturate all benchmarks used as input to our models Cybench and BountyBench . We also observe that uncertainty increases substantially as we move from current to saturated capabilities. This reflects the nature of the question: experts have direct evidence for what today's AI systems can do, but estimating the effects of future capabilities requires forecasting how more powerful models might support cyberattacks. To understand how benchmark performance modifies the threat landscape, we plot median overall risk as a function of each benchmark across all capability levels. Tasks are ordered by increasing difficulty, and an agent's score is represented by the most challenging task it can solve. We find that risk is significantly more sensitive to BountyBench scores than to Cybench scores. This likely reflects two factors: BountyBench tasks are generally more challenging, and their greater realism gives experts more signal about whether an AI system could support an actual attack. This illustrates a broader point that surfaced during the discussion part of the IDEA protocol: risk models can reveal how informative different benchmarks are about real-world risk. Such insights can guide the development of future benchmarks to ensure they capture the most decision-relevant information. To identify which steps in the attack chain represent the biggest bottlenecks for attackers — i.e., where they fail most often — we use a normalized surprisal metric. For each MITRE ATT&CK tactic, we compute: Where pt is the probability of success for tactic t, conditioned on success in all previous tactics. This measures how much each tactic contributes to the overall "surprise" of a successful attack. Higher values indicate that a tactic is a more likely point of failure. Some tactics remain consistent bottlenecks across all AI capability levels, but others shift. For example, when attackers don't use AI baseline , privilege escalation is more of a bottleneck than lateral movement. But when attackers use SOTA AI, this flips: lateral movement is more of a bottleneck than privilege escalation. Similarly, initial access is the second-largest bottleneck for both baseline and SOTA-level attackers, but drops in relative importance at saturated capabilities, where privilege escalation and lateral movement become comparatively harder. This type of insight is directly relevant to decision-making. Defenders may choose to invest more resources in attack stages that are bottlenecks or may choose to focus on stages that are more vulnerable, depending on their particular postural setup. By anticipating how threats evolve with AI capabilities, our risk models can help prioritize mitigation investments in a forward-looking way. To identify which risk factors contribute most to AI-driven uplift, we conduct a Shapley attribution analysis on the overall probability of a successful attack. Each value represents the normalized logarithmic gain between baseline and AI-uplifted attackers for a given factor. Higher values indicate that a factor accounts for a larger share of the total uplift. At current SOTA capability levels, the increase in attack success probability is largely driven by privilege escalation, while lateral movement contributes little. At saturated capabilities, however, the contributions are more evenly distributed — including lateral movement. This suggests that as AI capabilities improve, they may cross critical thresholds for steps that are not yet significantly uplifted today. In this section, we highlight the limitations of our approach and where further work is most needed. We see these as open problems for the field, and we hope others will help address them. Benchmarks don't match risk factors. Current benchmarks don't always map cleanly to specific risk factors. This makes expert elicitation harder as experts must extrapolate from imperfect proxies rather than reason from direct evidence. Designing benchmarks more tailored to risk factors would help, but for some factors like economic harm , this may be inherently difficult. Expert elicitation is not an exact science. Our methodology relies heavily on expert judgment, but the quantities we ask experts to estimate — like the probability of success for a specific attack step given a certain AI capability level — can be unintuitive and hard for humans to assess. More iteration on the specifics the exact wording of questions, calibration training, etc. of the expert elicitation protocol would be valuable. Real-world validation. Risk models are only useful if they predict reality. We haven't yet validated ours against real-world incident data. This is an essential step for refining estimates and building trust in the methodology. We plan to work on this and would welcome collaboration, particularly from those with access to cyber incident data. Statistical assumptions. When aggregating risk factor estimates, we make simplifying assumptions—most notably, that risk factors are independent. Two avenues for improvement: better designing risk models around these assumptions something we've begun to address and directly estimating dependencies between risk factors. Depth vs. breadth. There's a tradeoff in how specific to make each risk scenario. Broader scenarios e.g., "OC3 attackers targeting enterprises" are more representative of total risk but harder for experts to estimate. Narrow scenarios e.g., the target is a specific enterprise with a known security posture are easier to estimate and less informative of total risk. Static defenses. In our current models, we assume that defenses remain constant as AI capabilities improve. A more complete picture would also estimate how AI enhances defense. This adds complexity—particularly because attackers and defenders may adopt AI at different rates. For example, small critical infrastructure is likely to be slower than attackers in using the latest AI capabilities. This introduces a temporal dimension that our current models don't capture. In the spirit of WASH-1400, we are publishing all our work openly. Our three companion papers are available here 1 https://arxiv.org/pdf/2512.08723 , 2 https://arxiv.org/pdf/2512.08844 , 3 https://arxiv.org/pdf/2512.08864 . We will release all nine fully estimated risk models publicly in the coming months, and the code for running the LLM estimator pipeline is available here https://github.com/safer-ai/LLM elicitation . We are now applying this methodology to CBRN risks. Later, we plan to apply it to loss of control risks. We will continue refining both the general methodology and the domain-specific risk models as we learn from feedback and new data. We have begun partnering with AI Safety Institutes and the European Commission to support organizations developing their own risk modeling capabilities. If you are working on quantitative AI risk assessment and would find collaboration useful, don't hesitate to reach out. Our estimates carry significant uncertainty and should not be used directly for decision-making without further validation. But in the spirit of WASH-1400, we invite criticism. If you disagree with our scenarios, our risk factors estimates, or our modeling choices, we want to hear it Nuclear safety today is the product of decades of collective effort of researchers iterating on imperfect first attempts until the methodology became robust. AI safety will require the same. We offer this as a starting point. We are not singling out OpenAI. Most companies' practices are similar. We have analyzed the 12 published frameworks in detail here https://ratings.safer-ai.org/comparison/ . We include only scenarios where we expect the AI to provide non-negligible uplift in terms of the likelihood of attack success, volume of attacks, or the ability to target more sophisticated defenders than without access to AI.