A misalignment taxonomy

wpnews.pro

cd /news/ai-safety/a-misalignment-taxonomy · home › topics › ai-safety › article

[ARTICLE · art-35486] src=lesswrong.com ↗ pub=2026-06-21T10:20Z topic=ai-safety verified=true sentiment=· neutral

A misalignment taxonomy

A new taxonomy of AI alignment failures categorizes five types of inner misalignment and two types of outer misalignment, including precocious, gradient, capabilities-based, volition-based, and human misalignment, to clarify distinct failure modes in AI systems.

read3 min views1 publishedJun 21, 2026

I am going to discuss five kinds of inner misalignment and two kinds of outer misalignment, which create a simple taxonomy of alignment failure modes. When I talk about a kind of misalignment here, I am talking about a reason for misalignment (like inner/outer misalignment), not a kind of misaligned agent (like a schemer versus a fitness-seeker), although they can be related. It is possible that multiple of these failure modes could occur in unison; I am attempting to describe independent, but potentially overlapping, sources of misalignment.

**Precocious misalignment: **Precocious inner misalignment occurs when a half-baked [1] sub-optimizer

**Gradient misalignment: **Gradient misalignment [5] occurs when some bias

**Capabilities-based misalignment: **Capabilities-based misalignment occurs when training that rewards capabilities without fully factoring in alignment (such as most RLVR training setups) becomes salient enough that it overrides alignment-based training. This might cause the AI to engage in alignment faking during alignment training along with engaging in other kinds of goal-guarding.

**Volition-based misalignment: **Volition-based misalignment occurs when some constraint on alignment training causes it to fail to reward exactly what we intended, and this failure ends up being load-bearing for the AI's learned goals. These failures could include grader mistakes, insufficient time or knowledge on the part of the grader, the grader pool being insufficiently representative of some target population, or misclassifications by an automated grader that a human would not have endorsed.

The above all describe kinds of technical misalignment. By this, I mean that these failure modes can occur even if we 100% endorse the goal that the developers intend to instill in the AI. However, there are also various ways in which the intended objective could be "wrong." For example, a moral realist might say that we are uncertain if developers will instill a normatively correct objective function. An alternative framing is that the objective might differ from what you, the reader, would endorse or prefer. There are many other potential desiderata we might want to measure. For example, one might want to measure the degree to which the objective is democratically determined, and it’s possible that developers will fail by this standard. I call this broad set of potential failures in the intended objective “human misalignment.” Below I give a more complete misalignment taxonomy, including this category. Note that some fold “human misalignment” into outer misalignment, but I find it useful to point to the intended goal of the developer as an intermediary between the objective function and the “correct” intended goal.

I think creating a taxonomy of human misalignment would be a fun, although somewhat controversial, project.

What I call “precocious misalignment” may sound similar to “deceptive alignment” from Risks from Learned Optimization or “schemers” from Scheming AIs. I did not use the term "deceptive alignment" for a couple of reasons:

This second reason also differentiates precocious misalignment from scheming. Scheming is a designation about what the AI is trying to do, and precocious misalignment references a reason an AI might be misaligned. I also considered “crystallized misalignment” in reference to Joe Carlsmith’s crystallization hypothesis, which describes precociously misaligned AIs’ goals becoming completely locked in (rather than adjusting, to some extent, even after goal-guarding has begun). However, I want the term to be agnostic to the degree of post-goal-guarding goal adjustment as long as the goals are locked in enough to cause misalignment.

I specify it is "half-baked" because if the sub-optimizer were already "fully baked" and still misaligned, then this process of goal-guarding that is characteristic of precocious misalignment would not be a load-bearing property of the misalignment of that training process.

I say "sub-optimizer" and not "mesa-optimizer" because it may be that the AI is an optimizer (as in the mesa-optimizer case), but it may also be that the AI simply acts as an optimizer within a particular context.

This must include some form of training gaming in order to retain goals that would otherwise not be maximally fit.

See Appendix B for why I use the term "precocious misalignment" and how this failure mode compares to deceptive alignment and scheming.

I also like the name "gradient misspecification" here.

I would call issues from variance in the gradient "underfit misalignment."

source & further reading

lesswrong.com — original article Intuitive Self-Models (2024) The Cookie Monster Explains AI Safety How are there 0 studies (maybe 1) on sex-concordant hormone therapy?

~/api · this article 200

$curl api.wpnews.pro/v1/news/a-misalignment-taxonomy

Read original on lesswrong.com → www.lesswrong.com/posts/SAJoCCvmqyhba94sa/a-misa…

mentioned entities

Joe Carlsmith

metadata

sluga-misalignment-taxonomy

topic#ai-safety

secondary3 topics

sentimentneutral

canonicallesswrong.com

navigation

← prevThe Grammar of Coding Agents

next →28 Tips to Take Your ChatGPT Pro…

── more in #ai-safety 4 stories · sorted by recency

the-decoder.com · 21 Jun · #ai-safety

Sam Altman says a whole generation of researchers held AI back by underestimating what scaling could do

letsdatascience.com · 21 Jun · #ai-safety

Hany Farid Warns AI Makes Reality Indistinguishable

letsdatascience.com · 21 Jun · #ai-safety

Researcher Demonstrates How AI Robots Go Rogue

businessinsider.com · 21 Jun · #ai-safety

I've studied deepfakes for more than 25 years. Here's why AI is making it nearly impossible for you to know what's real.

── more on @joe carlsmith 3 stories trending now

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

wpnews · 20 Jun · #artificial-intelligence

Microsoft is rewriting the economics of enterprise AI and the bill shock is just getting started

wpnews · 20 Jun · #artificial-intelligence

Big Tech redirects buybacks into AI capital spending

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required