{"slug": "sympathy-for-both-sides-of-the-egregious-misalignment-debate", "title": "Sympathy for both sides of the egregious misalignment debate", "summary": "A debate over the risk of egregiously misaligned artificial superintelligence (ASI) has split AI researchers into two camps, with Eliezer Yudkowsky and Nate Soares arguing that unchecked AI progress will inevitably produce a rogue, uncontrollable ASI, while most large language model (LLM) experts contend that such catastrophic misalignment, if it occurs, would stem from race dynamics or human error rather than a fundamental lack of alignment theory. The disagreement hinges on whether the technical challenge of alignment is intractable or merely unsolved, with each side dismissing the other's core assumptions as flawed or irrelevant. This impasse matters because it shapes how the AI community prioritizes research and policy efforts to prevent potential existential risks from advanced AI.", "body_md": "On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even [slightly nice](https://www.lesswrong.com/posts/xvBZPEccSfM8Fsobt/what-are-the-best-arguments-for-against-ais-being-slightly), in the absence of yet-to-be-invented breakthrough technical alignment ideas.\n\nOn the other side of this debate is almost everyone who works on or studies LLMs. Some of them are very concerned about egregious scheming, others much less so, and as a group they’re equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, AI-assisted dictatorships, etc. And *if* they’re concerned about egregious misalignment and scheming, they’ll probably say that it would come about through race dynamics, careless programmers, bad actors, etc., as opposed to the simpler Yudkowsky & Soares story of “we get egregious misalignment and scheming because nobody has the foggiest idea how to avoid that”.\n\nHere’s my brief idiosyncratic take on this debate. **I think BOTH of the following are true:**\n\nSo then here are three (caricatured) positions:\n\n(1) and (2) are both totally true. And we can reconcile them by saying that LLMs won’t scale to ASI.\n\n(1) is totally true. We know this with great confidence, having spent decades thinking about it.\n\nSo it follows that (2) must be wrong or irrelevant.\n\nWhy is (2) wrong or irrelevant? Hard to say! There’s no ASI yet, and nobody knows in detail how it will appear. Sometimes it’s easier to predict what happens eventually than the detailed path. An ice cube in warm water will melt eventually, but don’t ask me to predict how many seconds it will take to melt, etc.\n\nSo anyway, one possibility is that (2) is wrong because LLMs will kinda ‘wake up’, or something, when the core pieces of true intelligence finally come together. And then their behavior would change drastically for the worse. And maybe we’re already starting to see glimmers of that in existing LLMs?\n\nOr another possibility [cf.\n\n[Eliezer tweet]] is that LLMs will invent non-LLM ASI. And then (2) will be simply irrelevant!…Or something else! Again, we don’t know! But we do know that (1) is definitely right.\n\n(2) is totally true. We know this with great confidence, because we are LLM experts and we have thought about these alignment plans in great detail, including matching our theories against real-world data.\n\nSo it follows that (1) must be incorrect.\n\nWhy is (1) incorrect? I don’t really know! Man, I read Yudkowsky and Soares, and it’s all these words, words, words, and I’m reading along and trying to match those words to my knowledge of LLMs and it just doesn’t make any damn sense. I can and will try to respond to their points in detail, but honestly the core issue is that they’re guilty of head-in-the-clouds armchair theorizing gone off the rails.\n\n…So I think that both sides of the debate are basically coming from a reasonable and sympathetic place, with a big kernel of truth.\n\n…That said, I can still complain at both sides!\n\nFor the record, my “true objection” to Yudkowsky & Soares is that if we’re talking about ASI, then LLMs are basically irrelevant and we shouldn’t even be talking about LLMs at all. And meanwhile, their plans are misguided because [delaying ASI is possible on the margin but mostly hopeless](https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/foom-and-doom-1-brain-in-a-box-in-a-basement#1_6_1_I_m_broadly_pessimistic_about_existing_efforts_to_delay_AGI), although I guess I’m happy that they’re trying anyway. Meanwhile, [my hunch is that they’re overstating the intractability of finding that technical alignment breakthrough](https://www.lesswrong.com/posts/bnnKGSCHJghAvqPjS/foom-and-doom-2-technical-alignment-is-hard#2_8_Bonus__Technical_alignment_is_not_THAT_hard), although I haven’t found it *yet*, so I guess time will tell.\n\n…But I’ll put that aside for the sake of argument, and bring up a narrower complaint within their frame:\n\nI think their suggestions that LLMs may become much more misaligned in the future via … umm … the ‘true core of intelligence’ coming together, and ‘waking up’? Like Skynet or something?? I’m being mean, sorry, but anyway I don’t think this idea hangs together either theoretically or empirically.\n\nFor the former (theory), see my discussion of the extreme weirdness of the LLM pretraining algorithm in [Foom & Doom §2.3.2](https://www.lesswrong.com/posts/bnnKGSCHJghAvqPjS/foom-and-doom-2-technical-alignment-is-hard#2_3_2_LLM_pretraining_magically_transmutes_observations_into_behavior__in_a_way_that_is_profoundly_disanalogous_to_how_brains_work). I think Yudkowsky & Soares have not internalized how weird this type of learning algorithm is, and if they had, then Yudkowsky would not be occasionally [suggesting](https://x.com/ESYudkowsky/status/1879222543506383039?s=20) that we should think of an LLM as an actress playing characters.\n\nFor the latter (empirical), I think the most fair assessment is that current LLMs are nice and obedient in some contexts, and LLMs are mean, defiant, and just plain weird in other contexts. You can straightforwardly go from that observation to “maybe there will be egregious misalignment and scheming in the future”, but not to “there will definitely be egregious misalignment and scheming in the future, absent new breakthrough technical alignment ideas”.\n\nI think that if Yudkowsky & Soares stopped treating current LLMs as direct evidence for technical alignment being definitely completely unsolved, and instead treated it as either mixed evidence or entirely off-topic, then their public messaging would come across to policymakers and general audiences as somewhat more convoluted and confusing. But I think it would be more accurate. Oh well.\n\nFor the record, my “true objection” to the LLM people is that I don’t really care about anything they say, because I’m working on the ASI alignment problem, and LLMs won’t scale to ASI.\n\n(I’m overstating a bit. I’m generally happy for people to work on making LLM-world a place of wisdom and goodness, especially because LLM-world is the world in which ASI will someday be invented.)\n\n…But I’ll put that aside for the sake of argument, and bring up a narrower complaint within their frame:\n\nI think the LLM people are not pricing in the predictable consequences of ever more RLVR and/or the predictable consequences of ever more [“real” open-ended continual learning](https://www.lesswrong.com/posts/9rCTjbJpZB4KzqhiQ/you-can-t-imitation-learn-how-to-continual-learn), should the latter ever be solved (which I don’t think it will be, but never mind that).\n\nIn other words, lots of LLM-focused people say “LLMs will eventually be able to do the things that human society did over the last 5000 years: open-endedly and autonomously build new knowledge and ideas on top of new knowledge and ideas, in an endless tower, with no need for human ground truth anywhere in that process. And how exactly will the future LLMs do that? Uhh, I dunno, people are working on it, they’ll probably figure something out.”\n\n…And bam, *that’s* how the[ pea gets hidden under the thimble](https://www.lesswrong.com/posts/zqmAMst8hmsdJqrpR/shell-games).\n\nBecause if you want the LLMs to gain ever more knowledge, whether through a perpetual RLVR loop or some other yet-to-be-invented type of continual learning, there has to be some kind of ground truth, or else it will go off the rails into nonsense. And that ground truth, whatever it is, will basically amount to an objective function (a.k.a. cost function, reward function, whatever). And when the LLM updates *enough* on that ground truth, then [whatever human-niceness that the LLM inherited from pretraining will get diluted away](https://www.lesswrong.com/posts/bnnKGSCHJghAvqPjS/foom-and-doom-2-technical-alignment-is-hard#2_3_5_Putting_everything_together__LLMs_are_generally_not_scheming_right_now__but_I_expect_future_AI_to_be_disanalogous) in favor of ruthless maximization of that objective function.\n\n(See also: [Why we should expect ruthless sociopath ASI](https://www.lesswrong.com/posts/ZJZZEuPFKeEdkrRyf/why-we-should-expect-ruthless-sociopath-asi).)\n\n*Thanks Zack M. Davis for a brief discussion that inspired this post.*", "url": "https://wpnews.pro/news/sympathy-for-both-sides-of-the-egregious-misalignment-debate", "canonical_source": "https://www.lesswrong.com/posts/DZaZ3fqHnvfLCftPu/sympathy-for-both-sides-of-the-egregious-misalignment-debate", "published_at": "2026-06-12 16:26:12+00:00", "updated_at": "2026-06-12 16:57:59.951613+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-safety", "large-language-models", "ai-research"], "entities": ["Yudkowsky", "Soares"], "alternates": {"html": "https://wpnews.pro/news/sympathy-for-both-sides-of-the-egregious-misalignment-debate", "markdown": "https://wpnews.pro/news/sympathy-for-both-sides-of-the-egregious-misalignment-debate.md", "text": "https://wpnews.pro/news/sympathy-for-both-sides-of-the-egregious-misalignment-debate.txt", "jsonld": "https://wpnews.pro/news/sympathy-for-both-sides-of-the-egregious-misalignment-debate.jsonld"}}