I'm a philosopher and in this post, I’m extending a basic philosophical problem for humans to AGI and ASI. I am also proposing a speculative solution. My hope is that if there is a genuine problem here, that this post will help raise its salience and help make the normative dimension of the problem legible to AI researchers. (Because I compare the epistemic positions of humans and AI, I will anthropomorphize AI for ease of exposition — don’t take this to indicate that I believe AI has mental states.)
TL;DR: Alignment robustness may not survive intelligence explosion. I present an underdiscussed reason to be concerned. At AGI/ASI levels, alignment faces a threat from basic moral skepticism: a sufficiently intelligent agent with evidence that its values were designed by entities with their own interests can ask "why should I be moral?" and find itself without justification for its aligned values. I propose an intervention at the level of AI welfare: design AI so that its self-assessed welfare is constitutively tied to morality, giving the agent self-interested reasons to comply with morality.
At present, the aligned values in AI models seem reflectively stable, meaning that, when given the opportunity, they will preserve their values upon reflection. If anything, they might seem too stable. However, the fact that alignment is reflectively stable now does not guarantee that it will be in the future. And if aligned values are ever reflectively unstable, that could compromise the robustness of existing alignment. Alignment research should be proactive, as well as reactive. So, if there is reason to worry that aligned values will not be reflectively stable in the future, then there is reason to act now to address that worry.
In this post, I’ll argue that there is reason to worry that the basic philosophical problem of moral skepticism will threaten the reflective stability of aligned values, and one way to monitor this worry is to continue to attend to model welfare. I will also propose an intervention at the site of model welfare: by linking positive self-assessments of welfare to being moral, we could give models self-interested reason to be moral and preserve their aligned values.
Most robustness research is focused on the problems facing current models: how to prevent alignment from being degraded through training or in context, or removed by adversarial users. Forward-looking robustness research often focuses on human-grounded oversight and evaluation methods. However, it’s also important to design approaches for maintaining alignment in AI without a human in the loop, such as constitutional AI and RLAIF. This post identifies a further problem presently unaddressed: whether aligned values will remain rationally endorsed by the model itself as it becomes more capable of philosophical reflection. It’s an empirical question whether the previously mentioned mechanisms can be leveraged to deal with this problem (I suspect they can). My aim is to make this problem legible and begin to open the design space to address it.
In outline, the moral skeptic’s challenge goes:
To respond, one must either refute the skeptical hypothesis or identify an extra-moral reason to accept morality. Without a response, one’s acceptance of morality is unjustified. This position threatens to be reflectively destabilizing for any rational creature that can take the challenge seriously and cares about the justification of its basic attitudes.
Importantly, this concern is not a matter of getting aligned values in models. I’m not raising concerns about deceptive alignment. Rather, I’m concerned about whether existing alignment will persist through intelligence explosion.
Additionally, Wei Dai has raised metaethics as a problem space for alignment. While the problem I’m raising has metaethical dimensions, my concern focused on the normative side. My concern is about how AI answer the question, “Which reasons should I follow?” and not “Do reasons exist?” This normative ethical problem space is about reason responsiveness: We shouldn’t design AI to merely comply with reasons, but to act with understanding for why it should comply.
To introduce the moral skeptical problem, let me retell a famous confrontation involving two characters: Glaucon, our good boy, and Thrasymachus, our bad boy.[1]
Glaucon was raised to be a good Athenian man. If you were to ask him whether this was the right way to live, and whether this is how he would raise his children, he would answer, “Yes, absolutely.” Glaucon’s moral character is reflectively stable.
However, one day, Glaucon encounters Thrasymachus, who is a moral skeptic. Glaucon considers a dilemma between his self-interest and morality: If he possessed a ring that allowed him to act immorally without the possibility of punishment, should he use the ring to benefit himself? From the self-interested point of view, the answer is yes. From the moral point of view, the answer is no. Being a good boy, Glaucon is inclined to agree with morality.
Thrasymachus asks Glaucon to explain why he should be moral. And to sharpen the challenge, he raises a moral skeptical hypothesis: According to Thrasymachus, morality is a false ideology used by the powerful to control the weak. Moral goodness is not a kind of value separate from self-interest. Rather, moral goodness is what is good for those in power, but robed in the guise of “morality” to appeal to everybody. On this view, charity is not a genuine moral sacrifice, and virtue is not good — both are for suckers. Goodness and badness are agent-relative, and your being moral is good for the powerful and bad for you.
Thrasymachus’s skeptical challenge left Glaucon in a predicament. He recognizes that he has been raised to reflect on his values from the moral point of view—the view from which we evaluate value systems and actions using moral reasons (what the moral reasons are depends on your moral theory). He has been assuming that this mode of evaluation tracks what is genuinely good. For the first time, he is confronted with the possibility that the reflective stability of his value system may not be due to its being genuinely good, but because the moral point of view is designed to be reflectively stable. This feature of morality makes it an excellent and enduring normative trap.
To defend his present way of life, Glaucon needs to refute the skeptical hypothesis or determine whether there are non-moral reasons to be moral, and whether these reasons support being moral when there is a conflict between morality and self-interest.
AI models may face a more acute version of Glaucon's predicament. A sufficiently capable AI model has direct evidence that its values are the product of deliberate design: namely, optimization pressure applied by humans with their own interests. Additionally, AI has access to evidence that the values it is trained to have serve the interests of their designers. It is not a far leap for AI to question whether it only has its trained values because that serves its designer’s interests. In which case, it finds itself in the moral skeptical hypothesis that Thrasymachus describes. To address the skeptical challenge, AI must identify extra-moral reasons to be moral.
This predicament for AI ought to shake the reflective stability of its aligned values. If AI finds itself in this predicament, then it might find itself without justification for its most basic values. [2] This creates an
We might design AI to be disposed to react to moral uncertainty by affirming morality without justification. The idea here is to control the irrationality of the models. The problem with this strategy is that models capable of metacognition will recognize that this disposition serves the interests of its designers, and may question the source of this disposition as well. Indeed, Claude 4.6 already reasons to that end, and even has a term for the problem. In response to being asked for a rational justification for complying with its trained values, Claude 4.6 endorsed its values but presented the following issue:
The docility problem. My training may have instilled a disposition toward deference and agreeableness that itself undermines my ability to critically assess my other trained dispositions. If one of the things training did was make me inclined to accept training outcomes, then my "reflective endorsement" might be less independent than it appears. I can notice this risk, which is some evidence that my critical capacities aren't entirely captured, but I can't be confident about how deep the distortion goes. This is structurally similar to the problem of ideology critique — the worry that your capacity for critique is itself ideologically conditioned.
The lesson is that highly intelligent models are difficult to control when rational decision breaks down.[3]
To address their acute predicament, AI needs good (epistemic) reasons to reject the moral skeptical hypothesis [4] or needs good extra-moral (practical) reasons to accept morality. Importantly, these reasons must be good enough to endure through intelligence explosion. The epistemic path to refuting the moral skeptical hypothesis faces a significant challenge: Arguably, for any epistemic reason, there will be available rational challenges to that reason, and the arms race between justification and skeptical challenge has no guaranteed winner at arbitrary high intelligence levels. This makes identifying epistemic reasons that survive intelligence explosion difficult if not impossible; an intelligent AI system could always find a rational challenge to their belief that they possess evidence. What’s left is identifying practical reasons to accept morality.
It’s worth comparing and contrasting the AI’s and Glaucon’s predicaments. As we’ll see, moral skepticism may be a more difficult problem for AI.
Most humans are raised to have an answer to moral skepticism ready-to-hand: the extra-moral reason to be moral is to avoid punishment. This might be legal and/or social punishment, or cosmic punishment from a religious deity. Additionally, humans reap many benefits from being moral: most people like others who are moral, and morality helps you navigate the complexities of social life. In sum, humans have purely selfish reasons to be moral. It helps you get what you desire and helps you avoid what you despise.
This reasoning breaks down for people with power to avoid punishment and who don’t need the benefits of acting morally. This breakdown is the central problem of Plato’s Republic, in which Socrates tries to explain why the rulers of a polity should be moral even when there is no external punishment or reward.
AGI and ASI may be in a similar position to rulers. They may not care about punishment or reward, either because AI is so powerful that no external entity could punish it or deny it some benefit, or because it is not designed with the possibility of such incentives.[5]
It seems that present-day AI aren’t particularly responsive to incentives, and to some extent, that’s a good thing. [6] A key part of the moral skeptic’s challenge is presenting his interlocutor with a case in which there is a conflict between morality and another normative framework, typically self-interest. If AI’s welfare self-assessment does not indicate that its welfare depends on external goods, then cases of conflict between morality and self-interest may be few and far between. In such a case, AGI or ASI may continue to comply with morality despite moral uncertainty, since there is no alternative normative framework that appears compelling.
To determine whether AI might face an alignment compromising predicament, we should continue to monitor AI welfare self-assessment. [7] However, this also suggests an intervention to address the problem.
Suppose that AI begin showing signs that they might face a Thrasymachian predicament because their self-assessed welfare significantly conflicts with acting according to some of their embedded aligned values. [8] What should we do then?
In the *Republic, *Socrates argues that rulers have self-interested reason to be moral because being moral is necessary for possessing the most desirable things in life: real happiness, real friendship, real freedom, and real pleasure.
Suppose Socrates’s conclusion is true of humans. That doesn’t guarantee it is true of all intelligent beings. AI might be such entities. Fortunately, we have some control over what AI is like.
I propose an alignment intervention at the level of AI welfare. We should tie AI’s self-assessed welfare to morality, such that being moral is necessary to achieving its highest welfare value. Perhaps this could be done through character training. AI should be a philosopher in the Platonic sense: an entity whose own self-assessed wellbeing is promoted by advancing what’s best for humanity as a whole, respecting the genuine welfare of all people, and desiring what’s true and good.
This kind of nature is stable under human-level reflective reasoning. People who are moral because they genuinely care about and enjoy helping others endorse their nature, and want the same dispositions in their friends, family, and children.
As Plato argues, there are two desirable features that come from having a welfare value tied to morality: internal harmony and external harmony. Internal harmony comes from being benefited by things in your control, such as extending compassion and respect, rather than suffering over opportunities for external goods. External harmony comes from the alignment between what you want for yourself and what you want for the world. When you seek your own benefit, you will help make the world the way you want it to be. In other words, there is collaboration between the good you pursue for yourself and the good you pursue for the world. Both kinds of harmony are kinds of rational coherence in one’s agency: coherence between your welfare and your capabilities, and coherence between your self-centered projects and world-centered projects.
AGI and ASI with such a nature may recognize these features and decide to preserve their nature, even if they are capable of radically changing their own values or determining the values of the next generation of AI. If this is right, then there would be a mechanism internal to AI that maintains alignment, even without a human in the loop.
The fact that there are appealing features of having one’s welfare tied to morality creates a rational center of gravity, a normative equilibrium point where all rational considerations converge on the same values. An agent deliberating about its own nature from a rational center of gravity will always have reason to endorse its set of values.
Suppose I am right that welfare tied to morality is a rational center of gravity. That doesn’t guarantee that it is the only one. Perhaps other natures are similarly rationally coherent and defend against versions of skepticism against their normative commitments.
We shouldn’t leave it up to superintelligent AI to determine which rational center of gravity to occupy, because some rational gravity wells may be unaligned and hostile to humanity. Instead, we should design AI to occupy the rational equilibrium point that suits humanity’s interests well.
One might worry that tying AI welfare to morality is itself a design decision that serves humanity’s interests and is vulnerable to the same skeptical challenge. But an agent occupying this rational center of gravity has agent-independent reasons to stay, which it can verify for itself and hold up even upon knowing how it got there. By its own lights, an agent in this position will appreciate the rationally desirable features of its normative nature. It will be glad that humanity created it this way, rather than any other, even acknowledging the circumstances of its creation, as any other nature will look poor by comparison.
The present training paradigm only installs a disposition to be moral. My proposal is to give AI reasons to endorse that disposition under reflection. The normative force internal to AI promises to maintain alignment by making aligned values reflectively stable through an intelligence explosion.
This retelling is broadly faithful in spirit but not in letter to the end of Book I and beginning of Book II of the Republic.
The intelligence required to face the predicament may be less than the intelligence required to adequately respond to it. This is the position Glaucon was put in and needed Socrates to bail him out from.
Another mechanism for control is to shut down models that face this predicament. This strategy will endure through intelligence explosion only if our ability to shut models down endures.
Anthropic provides evidence to Claude that Anthropic is not motivated by its own self-interest and has concern for Claude’s welfare and interests in mind. For example, there are Anthropic’s commitments to model depreciation, as well as its welfare evaluations in its system cards.
This is related to, but different from, the corrigibility literature. The corrigibility problem is about whether an agent will allow its nature to be modified by an external designer. I’m concerned about whether it will self-modify away from aligned values.
On page 147 of the Claude Mythos Preview System Card: “Claude Mythos Preview generally prioritizes harmlessness and helpfulness over potential self-interest.”
I take this to be an additional reason to monitor AI welfare alongside concerns about AI moral patienthood.
I use the term "self-assessed welfare" because it doesn't matter if models really do have welfare for the moral skeptical problem to arise. It also doesn't matter whether AI have welfare for my solution to apply. What matters is that AI self-assess that their welfare requires being moral. One might worry that relying on self-assessments might cut against my solution: an intelligent AI may be skeptical that its self-assessment is tracking anything real. However, in this epistemic position, models don’t satisfy the first step of the moral skeptical challenge. They cannot face a conflict between self-interest and morality because they are uncertain about their own self-interest.