How I think developers of frontier AI systems and regulators ought to act in the face of existential AI risk

wpnews.pro

This is a post I drafted on 2025-08-06 that doesn't live up to the ambitious title, but I'm publishing it anyways 10+ months later. "Don't let perfect be the enemy of good."

Below I describe a simple framework for thinking about the permissibility of building the next frontier AI system. *TL;DR: Evaluate the risk that proceeding with building the AI system will lead to catastrophic harm. (Get independent regulators to evaluate the risk too.) If the risk is too high, do not build it. *I am a layperson, but my take is that frontier AI companies generally don't even include as much nuance as I include here in their public communications, so I decided to write up this simple framework to try to improve the discourse. I'd appreciate any feedback. I don't think anything I'm saying here is remotely new, but I'm not aware of a post like this.

In a recent podcast episode published July 20, 2025, Anthropic co-founder Ben Mann is asked (at 48:43) "What are the odds that we align AI correctly and actually solve this problem?"

In his answer, Ben references the following part of Anthropic's March 8, 2023 blog post titled Core Views on AI Safety: When, Why, What, and How, which lays out a framework for what Anthropic thinks it should do given different possibilities for how difficult it will be to develop powerful AI that is aligned and safe:

One particularly important dimension of uncertainty is how difficult it will be to develop advanced AI systems that are broadly safe and pose little risk to humans. Developing such systems could lie anywhere on the spectrum from very easy to impossible. Let’s carve this spectrum into three scenarios with very different implications:

Optimistic scenarios:There is very little chance of catastrophic risk from advanced AI as a result of safety failures. Safety techniques that have already been developed, such as[(RLHF) and]reinforcement learning from human feedback[(CAI), are already largely sufficient for alignment. The main risks from AI are extrapolations of issues faced today, such as toxicity and intentional misuse, as well as potential harms resulting from things like widespread automation and shifts in international power dynamics - this will require AI labs and third parties such as academia and civil society institutions to conduct significant amounts of research to minimize harms.]Constitutional AIIntermediate scenarios:Catastrophic risks are a possible or even plausible outcome of advanced AI development. Counteracting this requires a substantial scientific and engineering effort, but with enough focused work we can achieve it.Pessimistic scenarios:AI safety is an essentially unsolvable problem – it’s simply an empirical fact that we cannot control or dictate values to a system that’s broadly more intellectually capable than ourselves – and so we must not develop or deploy very advanced AI systems. It's worth noting that the most pessimistic scenarios might look like optimistic scenarios up until very powerful AI systems are created. Taking pessimistic scenarios seriously requires humility and caution in evaluating evidence that systems are safe.

If we’re in an optimistic scenario…the stakes of anything Anthropic does are (fortunately) much lower because catastrophic safety failures are unlikely to arise regardless. Our alignment efforts will likely speed the pace at which advanced AI can have genuinely beneficial uses, and will help to mitigate some of the near-term harms caused by AI systems as they are developed. We may also pivot our efforts to help policymakers navigate some of the potential[posed by advanced AI, which will likely be one of the biggest sources of risk if there is very little chance of catastrophic safety failures.]structural risks

If we’re in an intermediate scenario…Anthropic’s main contribution will be to identify the risks posed by advanced AI systems and to find and propagate safe ways to train powerful AI systems. We hope that at least some of our portfolio of safety techniques – discussed in more detail below – will be helpful in such scenarios. These scenarios could range from "medium-easy scenarios", where we believe we can make lots of marginal progress by iterating on techniques like Constitutional AI, to "medium-hard scenarios", where succeeding at mechanistic interpretability seems like our best bet.

If we’re in a pessimistic scenario…Anthropic’s role will be to provide as much evidence as possible that AI safety techniques cannot prevent serious or catastrophic safety risks from advanced AI, and to sound the alarm so that the world’s institutions can channel collective effort towards preventing the development of dangerous AIs. If we’re in a “near-pessimistic” scenario, this could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime. Indications that we are in a pessimistic or near-pessimistic scenario may be sudden and hard to spot. We should therefore always act under the assumption that we still may be in such a scenario unless we have sufficient evidence that we are not.

From my perspective, the main thing missing from Anthropic's framework described above is the observation that nearly all intermediate scenarios ought to be treated as "near-pessimistic" scenarios.

A commenter @Roman Leventov pointed this out on the LessWrong cross-post of the blog post:

If we’re in a “near-pessimistic” scenario, this could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime. I think the precautionary principle dictates that we should do this in any scenarios apart from “obviously optimistic” or “nearly obviously optimistic”

To see why I agree with this, let's make Anthropic's model more nuanced by adding to the intermediate scenarios the probability that the company's safety efforts will adequately ensure that the next advanced AI system they are developing is broadly safe and poses little risk to humans.

Again:

Intermediate scenarios:Catastrophic risks are a possible or even plausible outcome of advanced AI development. Counteracting this requires a substantial scientific and engineering effort, but with enough focused work we can achieve it.

So let's change "but with enough focused work we can achieve it" to "but with enough focused work we can * probably* achieve it."

By adding probability to the framework here, it makes the correct thing to do depend on the probability that the "substantial scientific and engineering effort" is enough to counteract the potential catastrophic risks of the AI.

If the risk that the AI system will not be safe is sufficiently low, then it is permissible for Anthropic to do what it originally said it would do in intermediate scenarios, namely proceed with developing the AI system, or in Anthropic's words: "Identify the risks posed by advanced AI systems and [find] and propagate safe ways to train powerful AI systems." (I interpreted this as meaning "proceed with developing the AI system that Anthropic thinks is safe, now that it has done substantial alignment work.") However, if the risk that the AI system will not be safe is sufficiently *high *despite substantial alignment work, then Anthropic instead ought to do what it said it should do in near-pessimistic scenarios: "[Channel] our collective efforts towards AI safety research and [halt] AI progress in the meantime."

To give a concrete example, if the probability is virtually 100% that the AI system being built will not cause any catastrophic harm, perhaps because it is too weak of a system like Claude 4, then proceeding with developing it seems permissible.

However, if we reach a stage in the next few years where frontier AI systems are approaching the point where they are powerful enough to be existentially risky, and Anthropic or an independent evaluator of their next AI model concludes that in their subjective judgment there is, say, a 1% chance that Anthropic's alignment and steering measures will fail and that their next frontier model will take over and cause human extinction or another extremely bad outcome if developed, then this "intermediate scenario" is actually a near-pessimistic scenario because a 1% risk of human extinction or other existential catastrophe is unacceptably high. So the correct thing to do in this situation would be to treat it as a near-pessimistic scenario by "channeling our collective efforts towards AI safety research and halting AI progress in the meantime."

Is this scenario plausible? Well, yes. For example, Geoffrey Hinton says that his independent impression is that AI existential risk is about 50%, and that his all-things-considered view once he takes into account the opinions of everyone he knows is about 10-20%. If he's right, then in 10-20% of futures someone is going to build an AI system that causes an existential catastrophe. If they do, and they're responsible like Anthropic aims to be, then they should make a subjective forecast of the probability that the AI system will cause an existential catastrophe before they build it. Ideally they should say "this seems like it has a 1% chance of causing catastrophe, let's stop" rather than "this seems extremely unlikely to cause a catastrophe, let's keep developing it" and then have it go on to cause an existential catastrophe after it is built.

What is the probability threshold where the permissible thing to do would be to proceed with the probably-safe development of the next frontier AI system, rather than halt and work on improved safety?

I don't know precisely, but surely any company's confidence that the next AI system it's developing will not cause human extinction must be above 99% for it to be permissible to continue developing it. Even a 1-in-a-million chance that you're wrong about the adequacy of your safety measures and that the AI will actually cause an existential catastrophe via takeover or gradual disempowerment or whatever seems like too high of a chance for it to be permissible to build the AI system.

All of humanity should have a say in the permissible amount of human extinction risk that AI developers are permitted to take on in the development of powerful AI. There ought to be global regulations ensuring that nobody is allowed to build a new AI system with a more than 1 in a million chance that the system takes over and causes extinction or another extremely bad outcome.

The AI systems that have been built already seem too weak to pose any significant takeover or extinction risk currently, but as AI systems become more powerful and approach the point of being existentially-risky, regulators and independent evaluators ought to be in the loop with evaluating plans to train and build frontier AI systems to ensure that they are at least 99.9999% likely to not cause an existential catastrophe, before further development of the AI is allowed.

Without this added nuance about the probability of success in the intermediate scenarios, I think Anthropic's default answer of what to do in the intermediate scenarios is problematic. Given the the societal-scale stakes, I think the right thing to do in the intermediate scenarios is to follow the precautionary principle and refrain from developing any new frontier AI system unless it has been shown that it is very likely to be safe, and in particular >99.9999% likely not to cause extinction. If any AI developer cannot be that confident, or any independent evaluators are not that confident, then the right thing to do is to halt developing the model and related AI progress and channel efforts towards AI safety research just like in any other near-pessimistic scenario.

Do I have confidence that Anthropic and other leading AI companies will do this? I think there is some chance they will, but I don't have enough confidence that they will because they don't give assurances to that effect.

For example, from the podcast mentioned previously, here is Ben Mann's full answer to the question "What are the odds that we align AI correctly and actually solve this problem of stuff you’re very much working on?" Benjamin Mann:It’s a really hard question. And there’s really wide error bars. Anthropic has this blog post called “Our Theory of Change” or something like that. And it describes three different worlds, which is like, how hard is it to align AI? There’s a pessimistic world where it’s basically impossible. There’s an optimistic world where it’s easy and it happens by default. And then there’s a world in between where our actions are extremely pivotal. And I like this framing because it makes it a lot more clear what to actually do. If we’re in the pessimistic world, then our job is to prove that it is impossible to align safe AI and to get the world to slow down. And obviously, that would be extremely hard. But I think we have some examples of coordination from nuclear non-proliferation and in general, slowing down nuclear progress. And I think that’s still like doomer world, basically. And as a company, Anthropic doesn’t have evidence that we’re actually in that world yet. In fact, it seems like our alignment techniques are working. So at least the prior on that is updating to be less likely.In the optimistic world, we’re basically done. And our main job is to accelerate progress and to deliver the benefits to people. But again, I think actually the evidence points against that world as well, where we’ve seen evidence in the wild of deceptive alignment, for example, where the model will appear to be aligned, but actually has some ulterior motive that it’s trying to carry out in our laboratory settings.

And so I think the world we’re most likely in is this middle world where alignment research actually does really matter. And if we just do sort of the economically maximizing set of actions, then things will not go well.Whether it’s an X-risk or just like produces bad outcomes, I think is a bigger question. So taking it from that standpoint, I guess to state a thing about forecasting, people who haven’t studied forecasting are bad at forecasting anything that’s less than a 10% probability of happening. And even those that have, it’s quite a difficult skill, especially when there are few reference classes to lean on. And in this case, I think there are very, very few reference classes for what an X-risk kind of technology might look like. And so the way I think about it, I think my best granularity of forecast for like, could we have an X-risk or extremely bad outcome from AI is somewhere between 0 and 10%. But from a marginal impact standpoint, as I said, since nobody is working on this roughly speaking, I think it is extremely important to work on. And that even if the world is likely to be a good one, that we should do our absolute best to make sure that that’s true.

Putting aside Ben's "somewhere between 0 and 10%" answer, does Ben mention that even in the intermediate scenarios where solving alignment seems possible, the best course of action may often be to halt AI progress anyways because the risk that alignment measures will be inadequate is unacceptably high? No. He just says that in the middle worlds alignment research really matters. It would have been the perfect opportunity for him to say that if we weren't that confident in the alignment research and the situation appeared too risky. Then Anthropic and humanity could just decide to slow down or halt AI progress to reduce existential risk just as if we were clearly in a pessimistic world. But he didn't.

On a recent podcast published Jul 30, 2025, Dario Amodei acknowledged AI risk and the possibility of slowing down development:

If we got to much more powerful models with only the alignment techniques we have now, then I'd be very concerned. Then I'd be going out there saying that everyone should stop building these things. Even China should stop building these. I don't think they'd listen to me ... but if we got a few years ahead in models and had only the alignment and steering techniques we had today, then I would definitely be advocating for us to slow down a lot. That's good to hear. But what he said next is not.

Dario then proceeded to say that to say that "an intellectually and morally unserious way to respond to the situation" is to say that you "know there’s no way to build this safely" and that you "can kind of logically prove that there’s no way to make them safe."

Full excerpt:

On one hand, we have a cadre of people who are just doomers. People call me a doomer but I'm not. But there are doomers out there. People who say they know there’s no way to build this safely. You know, I’ve looked at their arguments. They're a bunch of gobbledegook. The idea that these models have dangers associated with them, including dangers to humanity as a whole, that makes sense to me. The idea that we can kind of logically prove that there’s no way to make them safe, that seems like nonsense to me. So I think that is an intellectually and morally unserious way to respond to the situation. I also think it is intellectually and morally unserious for people who are sitting on $20 trillion of capital, who all work together because their incentives are all in the same way, there are dollar signs in all of their eyes, to sit there and say we shouldn’t regulate this technology for 10 years.

First of all, as I quoted above, Anthropic has written "Indications that we are in a pessimistic or near-pessimistic scenario may be sudden and hard to spot. We should therefore always act under the assumption that we still may be in such a scenario unless we have sufficient evidence that we are not."

It seems unwarranted to me for the CEO of a company that claims we should "always act under the assumption that we still may be in such a [pessimistic] scenario" to call the view that we are definitely in a pessimistic scenario "morally unserious." And yet that's what he's doing when he attacks the views of those who say they "know there’s no way to build this safely."

Recall:

Pessimistic scenarios:AI safety is an essentially unsolvable problem – it’s simply an empirical fact that we cannot control or dictate values to a system that’s broadly more intellectually capable than ourselves – and so we must not develop or deploy very advanced AI systems. It's worth noting that the most pessimistic scenarios might look like optimistic scenarios up until very powerful AI systems are created. Taking pessimistic scenarios seriously requires humility and caution in evaluating evidence that systems are safe.

Second of all, who was he referring to who takes that view? He didn't say. He used the term "doomer," but Yudkowsky and many others who have been called "doomers" don't believe that. Was he strawmanning Yudkowsky or did he have someone else in mind who actually does believe this, such as possibly Roman Yampolskiy? (For what it's worth, I agree that Yampolskiy's ridiculously overconfident p(doom) of 99.999999% is epistemically unserious, but as I said above I wouldn't call him morally unserious for thinking alignment is impossible even in principle, nor even for thinking that he can prove that, if that is his view.)

In any case, the fact that Dario chose to criticize this epistemically extreme and unpopular view and yet didn't say anything about the views of Yudkowsky or Hinton or Bengio or others (who think that alignment is solvable in principle, but has a significant chance of failing in practice) is concerning to me.

It is concerning because there is a whole bunch of middle ground between "the alignment and steering techniques we have today [are probably not adequate to steer AI models a few years from now and ensure they are safe]" and "alignment is impossible in principle".

When Dario doesn't acknowledge all that middle ground that people like Yudkowsky and Hinton and Bengio occupy, and instead merely attacks the extreme view that alignment is provably impossible, it makes me concerned that Dario may just plow ahead and build significantly-existentially-risky AI models in the next few years even if his new alignment and steering techniques are not sufficiently likely to be adequate.

People like me—who are concerned that alignment and steering and safety techniques may not improve fast enough to ensure that models a few years down the line won't cause catastrophic problems—would like to hear AI company leaders acknowledge that the catastrophic risks may be higher than they expect. We'd also like to hear that the risks of future frontier models they plan to develop will be properly assessed by independent evaluators, to ensure they are sufficiently likely to be safe before they are developed. And if the evaluations do not show that they are sufficiently safe, I'd like AI companies to make clear that they will not build the systems.

At [1:04:31](https://youtu.be/mYDSSRS-B5U?si=-Y_UQQCb6M7D8r3E&t=3871) of the same podcast, Dario says:

If you have a better strategy for balancing the benefits, the inevitability of the technology, and the risks that it faces, I am very open to hear it, because I go to sleep every night thinking about it because I have such an incredible understanding of the stakes.

Dario is not signaling to me that he is taking the risks appropriately seriously when all he does after saying this is criticize the view that it's provable that there is no way to make AI safe, instead of acknowledging the much more epistemically modest concerns of Yudkowsky, Hinton, Bengio, etc.

Consider this post my attempt to provide a "better strategy". In summary, communicate that Anthropic will do the right thing as its frontier models approach the point of being existentially risky.

source & further reading

lesswrong.com — original article World-modeling the US vs. Anthropic Standoff on Claude Fable AI Safety Ecosystem Research notes A brief list of ways AI safety efforts could be net negative

How I think developers of frontier AI systems and regulators ought to act in the face of existential AI risk

Run your AI side-project on zahid.host