I am going to discuss five kinds of inner misalignment and two kinds of outer misalignment, which create a simple taxonomy of alignment failure modes. When I talk about a kind of misalignment here, I am talking about a reason for misalignment (like inner/outer misalignment), not a kind of misaligned agent (like a schemer versus a fitness-seeker), although they can be related. It is possible that multiple of these failure modes could occur in unison; I am attempting to describe independent, but potentially overlapping, sources of misalignment.
**Precocious misalignment: **Precocious inner misalignment occurs when a half-baked [1] sub-optimizer
**Gradient misalignment: **Gradient misalignment [5] occurs when some bias
**Capabilities-based misalignment: **Capabilities-based misalignment occurs when training that rewards capabilities without fully factoring in alignment (such as most RLVR training setups) becomes salient enough that it overrides alignment-based training. This might cause the AI to engage in alignment faking during alignment training along with engaging in other kinds of goal-guarding.
**Volition-based misalignment: **Volition-based misalignment occurs when some constraint on alignment training causes it to fail to reward exactly what we intended, and this failure ends up being load-bearing for the AI's learned goals. These failures could include grader mistakes, insufficient time or knowledge on the part of the grader, the grader pool being insufficiently representative of some target population, or misclassifications by an automated grader that a human would not have endorsed.
The above all describe kinds of technical misalignment. By this, I mean that these failure modes can occur even if we 100% endorse the goal that the developers intend to instill in the AI. However, there are also various ways in which the intended objective could be "wrong." For example, a moral realist might say that we are uncertain if developers will instill a normatively correct objective function. An alternative framing is that the objective might differ from what you, the reader, would endorse or prefer. There are many other potential desiderata we might want to measure. For example, one might want to measure the degree to which the objective is democratically determined, and it’s possible that developers will fail by this standard. I call this broad set of potential failures in the intended objective “human misalignment.” Below I give a more complete misalignment taxonomy, including this category. Note that some fold “human misalignment” into outer misalignment, but I find it useful to point to the intended goal of the developer as an intermediary between the objective function and the “correct” intended goal.
I think creating a taxonomy of human misalignment would be a fun, although somewhat controversial, project.
What I call “precocious misalignment” may sound similar to “deceptive alignment” from Risks from Learned Optimization or “schemers” from Scheming AIs. I did not use the term "deceptive alignment" for a couple of reasons:
This second reason also differentiates precocious misalignment from scheming. Scheming is a designation about what the AI is trying to do, and precocious misalignment references a reason an AI might be misaligned. I also considered “crystallized misalignment” in reference to Joe Carlsmith’s crystallization hypothesis, which describes precociously misaligned AIs’ goals becoming completely locked in (rather than adjusting, to some extent, even after goal-guarding has begun). However, I want the term to be agnostic to the degree of post-goal-guarding goal adjustment as long as the goals are locked in enough to cause misalignment.
I specify it is "half-baked" because if the sub-optimizer were already "fully baked" and still misaligned, then this process of goal-guarding that is characteristic of precocious misalignment would not be a load-bearing property of the misalignment of that training process.
I say "sub-optimizer" and not "mesa-optimizer" because it may be that the AI is an optimizer (as in the mesa-optimizer case), but it may also be that the AI simply acts as an optimizer within a particular context.
This must include some form of training gaming in order to retain goals that would otherwise not be maximally fit.
See Appendix B for why I use the term "precocious misalignment" and how this failure mode compares to deceptive alignment and scheming.
I also like the name "gradient misspecification" here.
I would call issues from variance in the gradient "underfit misalignment."