# We Need Breadth-First AI Safety Plans

> Source: <https://forum.effectivealtruism.org/posts/vxnHdgsXuxGGi6qqS/we-need-breadth-first-ai-safety-plans>
> Published: 2026-06-01 17:36:35+00:00

*Cross-posted from my website.*

**Depth-first** plans lay out a path from here to aligned superintelligent AI. We need those kinds of plans. But depth-first plans depend on many assumptions: "We will make AI safe by doing step 1, then step 2, then step 3." Step 1 only works under condition A, step 2 requires condition B, step 3 requires condition C. If A or B or C is false, the whole plan fails (and there's a good chance we all die).

Consider [Google's safety plan](https://deepmind.google/discover/blog/taking-a-responsible-path-to-agi/) from April 2025. To my knowledge, this is the best among the frontier AI companies' plans.[[1]](https://forum.effectivealtruism.org/feed.xml#fn-QGHEWZ4LM62HwctDF-1)

Google's plan depends on a series of conditions:

(The plan depends on many more conditions than that, but I'll keep it short.)

That list included eight conditions. If any one of those conditions fails, then the whole plan fails. Some of the conditions seem likely to be true; others seem questionable. But even if every individual condition is probably true, it's much less likely that they're *all* true.

Disjunctive conditions are better than conjuctive ones. We can see an example in condition 3.1 above: Google's plan can work if it's possible to align the "bootstrapper" AI, OR if misalignment is easy to spot, OR if it doesn't need to be aligned. Disjunctive conditions are good; more of those, please.

We need **breadth-first** plans:

- We will take actions X, Y, and Z.
- X depends on condition A.
- Y works even if A is false, but it depends on condition B.
- Z works if A and B are false; it depends on a third condition C.

X + Y + Z works even if two out of three conditions fail.

Some plans have a little bit of breadth. An explicit example from Google's safety plan:

Our approach has two lines of defense. First, we aim to use model level mitigations to ensure the model does not pursue misaligned goals. [...] Second, we consider how to mitigate harm even if the model is misaligned (often called “AI control”), through the use of system level mitigations.

I would like to see **more** breadth, and **recursive** breadth—there should be breadth within each component of the plan, and breadth within those sub-components.

The broadest plan that's been published is Peter Barnett & Aaron Scher's [AI Governance to Avoid Extinction: The Strategic Landscape and Actionable Research Questions](https://intelligence.org/wp-content/uploads/2025/05/AI-Governance-to-Avoid-Extinction.pdf) (see also the corresponding [LessWrong post](https://www.lesswrong.com/posts/WkCfvqyjCzvRrwkaQ/ai-governance-to-avoid-extinction-the-strategic-landscape)). The report explicitly considers four possible future scenarios and how we might achieve a good outcome from within each scenario. The report even includes a flowchart:

The report goes into more detail about the conditions required for each of the four scenarios to succeed.

Barnett & Scher believe "Off Switch and Halt" is the best strategy. They don't exactly phrase it this way, but according to their report, "Off Switch and Halt" depends on the *fewest conditions* and has *multiple ways of succeeding*.

I see two big benefits to writing breadth-first plans:

The good news is the branches off the roots are the most important because they have the greatest probability mass. Creating layers of branches off branches off branches quickly gets complicated, but I don't think it's necessary.

I made a quick flowchart to categorize AI safety plans at a high level.

The idea is that we need a broad set of overlapping plans such that *some* plan will work, even if many conditions (red nodes) turn out to be false.

(Click [here](https://mdickens.me/assets/images/AI-plans.png) to see the full-size image.)

Is this flowchart comprehensive? Definitely not. Is it even accurate? Maybe. My point is that, to make AI safe, we need multiple plans that cover all the ways the other plans could go wrong, and this flowchart is a quick attempt at representing some of those plans.

I originally wrote this article shortly after April 2025, but I procrastinated for a year on finishing it, so I'm not sure about the current state of AI companies' plans. [↩︎](https://forum.effectivealtruism.org/feed.xml#fnref-QGHEWZ4LM62HwctDF-1)

I am skeptical that a bootstrapped-aligned AI will behave morally in ways in which most humans do not behave morally, e.g. eating factory-farmed animals; or that it will be able to correctly resolve the internal inconsistencies in common-sense ethics. For example, in the [mere addition paradox](https://en.wikipedia.org/wiki/Mere_addition_paradox), most people accept a set of premises but reject the conclusion that necessarily follows from those premises.[[4]](https://forum.effectivealtruism.org/feed.xml#fn-QGHEWZ4LM62HwctDF-4)[↩︎](https://forum.effectivealtruism.org/feed.xml#fnref-QGHEWZ4LM62HwctDF-2)

Technically, what we want isn't paths that depend on few conditions. We want paths where the joint probability of every condition is as high as possible. But generally speaking, fewer conditions means the probability of success is higher. [↩︎](https://forum.effectivealtruism.org/feed.xml#fnref-QGHEWZ4LM62HwctDF-3)

Philosophy Experiments' [Philosophical Health Check](https://www.philosophyexperiments.com/health/Default.aspx) asks you a series of questions and purports to identify inconsistencies in your beliefs. I think the questions leave some wiggle room to argue that supposed inconsistencies aren't truly inconsistent, but a more rigorous test would be harder to construct. [↩︎](https://forum.effectivealtruism.org/feed.xml#fnref-QGHEWZ4LM62HwctDF-4)
