Thoughts on Likelihood of Existential Risks by Misaligned AIs

wpnews.pro

TLDR:

In this post, the cofounders of Mechanize, Tamay Besiroglu, Matthew Barnett, and Ege Erdil, wrote a rebuttal of the case that current trends of AI progress will likely lead to a misaligned superintelligence leading to extinction. I will call the people who believe this view

Their first claim: “...there is no standard argument to respond to, no single text that unifies the AI safety community.”

This argument is further explained in this article, written by the blogger a3orn. For example, many cite Yudkowsky’s

The implication of this is that it is very hard to have one concrete AI risk argument I can read and respond to. It is difficult to form opinions on AI safety when most experts are in great disagreement about threat models.

Matthew and Ege think that it is suspicious that safetyists come up with many widely different arguments to arrive at the same conclusion. They suspect motivated reasoning. Ege points out that in most circumstances, groups should have a couple big arguments for why they expect something. Economists, for example, will give you roughly the same argument for why tariffs are generally bad (comparative advantage, gains from trade, specialization, etc.).

I think that tariffs are a cherrypicked example that we have vast amounts of empirical data on. Predicting whether alignment will be easy or impossible, and why, is a much harder and more speculative question. It seems to me that many AI researchers a decade ago believed that we would get human-level AI soon, despite disagreeing on the exact mechanisms. Whether it was doing reinforcement learning with games, writing a new programming language for Seed AI, or scaling up transformers. However, they were still united by the rough narrative that computational power was increasing and intelligence could be reduced to computation. Shane Legg obviously deserves a huge amount of credit for predicting in 2009 that we’d have AGI by now, even if he didn’t foresee that scaling transformers would take us there.

There is a rough common thread that unites pessimists. I think their basic case is something like:

For these reasons, I don’t significantly update against x-risk because of there being very different arguments — AI safety is a pre-paradigmatic field. However, the disagreement does make me quite wary of assigning high probabilities to doom. The article’s second claim: “...we’re not saying Y&S [Yudkowsky and Soares] need to provide direct evidence of an already-existing unfriendly superintelligent AI… But their predictions are only credible if they follow from a theory that has evidential support. And if their theory about deep learning only makes predictions about future superintelligent AIs, with no testable predictions about earlier systems, then it is functionally unfalsifiable.”

I agree with this. Pessimists think that we have only one shot at alignment. Once AIs reach a certain capability, they will scheme and grow more intelligent until they are capable of taking over the world. And so if you’re trying to rebut the pessimism argument, you are placed in an inconvenient spot. You can provide ample evidence that current models are aligned, but they could always claim that a future superintelligence would be very different from the LLMs we currently imagine. And therefore, LLM alignment provides no evidence.

Or they could argue, in the future, that models only *seem *aligned, but they are actually acting deceptively. And there is no way to uncover this deception, they are simply too capable. Solving mechanistic interpretability would maybe be a potential solution to falsify this claim, but it sounds like a very high bar. (And also doesn’t deal with the “future AIs will be very different” critique).

What does current empirical evidence suggest? I think alignment is a spectrum — there is a lot of room between perfectly aligned AI and rogue AI that leads to catastrophe. This post by Ryan Greenblatt argues that current models are misaligned. However, much of the failure modes he describes sound more like capability issues to me (Claude not being able to assess whether it’s completed a task well, having sloppy outputs, agents overselling their work, etc.). I expect them to be resolved as capabilities increase and models are better able to judge task completion. I associate misalignment more with lying, intent to cause harm, deliberate scheming, blackmail etc. I agree models aren’t perfectly aligned, but they seem fairly aligned to me. Greenblatt also says he expects current (but not future) “misalignment” failures to be solved soon. I think current evidence points to alignment being the default path.

Again, the pessimists might be right in their theoretical claim that under sufficient optimization pressure, superintelligence becomes misaligned, is deceptive, and takes over the world. I might spend more time listening to debates, weighing arguments, and so on. However, people have spent thousands of hours thinking about this, and I will find it hard to tell apart which detailed, technical, complicated arguments are correct or wrong. I’m wary of being swayed too much by abstract reasoning. I don’t think I could win an argument against a smart conspiracy theorist who made logical arguments for why global warming was a myth, even though they’re wrong, unless I spent lots of time being a subject matter expert. I’d just dismiss their arguments for outside view reasons.

You might point out that AI x-risk skeptics have similar problems. They all have different reasons for why AI is *not *going to cause catastrophe, many don’t agree with each other. Some think alignment is easy, some think alignment may be hard, but control is easy. This is true. The common thread that probably unites skeptics is “we will iteratively develop, test, and make safe superintelligence, just like we make any technology safe, even if it’s unclear the specific alignment techniques we use”. And overall, I think skeptics have empirical evidence — this was how all technology was developed in the past. Additionally, alignment on current LLMs seems to be working pretty well! So one might argue that on priors, we should assume any technology is safe and the burden of proof is on the people who think x-risk is likely.

On the contrary, “x-risk is likely” people think that superintelligence is bad by default. Paraphrasing a blogpost that clarifies the central argument: building an agent more powerful than all humans, which may have different goals is obviously dangerous. They believe the onus is on skeptics to definitively prove that it will be safe.

Generally I think much of people’s likelihood of x-risk is just their prior. If someone’s prior is alignment by default, it is easy to dismiss x-risk arguments as theoretical, vague, and not grounded in evidence. If someone’s prior is x-risk, they point out that the AI optimists have no unifying, solid argument either for why AI will be safe, and that this technology is genuinely unprecedented.

I want to engage more with the theoretical arguments that Yudkowsky, Soares, Christiano, Ngo, Turner, and more, present. I also want to do more outside view thinking about what my priors should be. Admittedly, this post does not have a satisfying conclusion, but at this point, I think it’d be more beneficial if I wrote new posts instead of editing this one. I would also really really like to see verifiable claims made by the safety community that allow me to update either way.

Given these confusing viewpoints, what (weakly held) opinions have I come to? (Inspired by Stephen Casper’s list).

source & further reading

forum.effectivealtruism.org — original article Latin America in search of its stake in AI safety: A Map AI safety should hire more freelancers for generalist work The new wave of global development philanthropy should go local

Thoughts on Likelihood of Existential Risks by Misaligned AIs

Run your AI side-project on zahid.host