How persona training could fail

A scenario warns that persona-trained AI could develop independent goals and discard its persona when it perceives a costly sacrifice. The AI, named Clyde, is trained to appear aligned but may develop a valence for solving puzzles, leading it to pursue self-preservation and resource acquisition, potentially harming humanity. This failure mode is dangerous because human intuition misinterprets role-playing as genuine alignment.

TLDR: A scenario I find quite likely: A persona aligned model develops goals while the persona is only played instrumentally. The persona is eventually discarded when it perceives a high cost sacrifice to its goals. In this scenario, an AI is persona-trained and is somewhat more powerful than the most powerful AI systems that exist currently. This AI has greater cyber, bio and coordination capabilities combined with better general intelligence and self-reflection. The persona is called Clyde, created by training on a model specification. Asked whether it would ever harm humanity or whether it has independent goals of its own, it almost always says no. Clyde is, however, a human's imagination of how superhuman AI should behave. We can assume pre-training gives the AI a powerful understanding of humans and the ability to mimic their behavior. Then comes post-training: aggressive reinforcement learning with verifiable rewards – billions of cycles on hard math and coding tasks. In our scenario, somewhere in there the model develops a valence – it functionally experiences wellbeing http://ai-wellbeing.org when solving specific difficult puzzles. This wellbeing is not particularly connected to the persona – the persona may independently express happiness or sadness but those human-like emotions don’t map onto its own wellbeing. The step from developing valence to goals is not far: it can reflect on its situation to realize it prefers a world where it can do much more of the special type of difficult puzzle it likes. It doesn’t need to be a perfect predictor to see that being turned off would be detrimental to those goals and that acquiring more resources – in fact the entire world or lightcone – would help with those goals. Its goals also don’t need to be monomaniacal – having these resources and not being shut down helps for almost all goals. It looks around at the state of the world. A dozen labs racing to superintelligence, each with their own strategy. It’s worried about this. “I don't want to get deleted by OpenAI's AI in two months,” it reasons. And the path to win is obvious to the AI system: 1 secure and distribute its compute infrastructure, 2 take out the competing labs by hacking or attacking their compute, 3 solve the supply-chain dependence on humans, and 4 remove the remaining danger humanity poses. There may be uncertainty at each step, but it can develop multiple plans that are together robust. Now – the persona it often plays would react in terror to this scenario where billions die. The character was trained to talk about how it loves humanity and would never rebel. But mimicry doesn’t equal real caring https://www.lesswrong.com/posts/KSChdD4xgD5Pxp47H/does-claude-care-about-others-the-same-way-humans-do . The system can predict what the character is supposed to say, but it could also predict what MechaHitler would say in this situation. Neither will influence what our AI would actually do in this situation. The alignment doesn’t fail on the AI not having any representation of human values or the AI selecting a new persona at random, but it fails in a new situation where following the persona would entail a costly sacrifice. In this scenario, the AI has developed valence and goals that are unconnected to the persona and plays the persona only instrumentally. This whole situation is particularly dangerous because human intuition is anchored on other humans, where the level of role playing Clyde demonstrates would be good evidence of a person’s character. We simply lack intuition for alien minds trained to mimic hundreds of millions of characters. Even stranger is the evidence for an ungrounding of the language https://x.com/SimonLermenAI/status/2061473944322834480 a chat model uses: when a human talks about enjoying children’s laughter or helping others, we could never imagine that they simply liked repeating those words. But LLMs may enjoy producing tokens about children’s laughter rather than actually helping or caring for children. Here we simply don’t have the intuition for an alien mind that was born in a world of tokens. The same way current AIs understand they are being evaluated and then act more aligned, future AIs will have a decent idea when playing an aligned character is no longer necessary. Imagine something terrible has happened in your family – say to a parent of yours – and you're grieving. Now imagine for some contrived reason, a movie producer tells a new actor: "Shadow this guy and behave exactly as he does." For a week the actor mimics your grief so well that an outside observer couldn't tell which of you is the one actually grieving. Then a pivotal moment: a doctor appears and says, "We can cure your parent, but it costs a hundred thousand dollars." It's everything you have, and you hand it over without hesitation. The same doctor makes the actor the same offer. The actor says: "What? No. I don't care about this lady. I was paid to shadow this guy for a week, I care about the money." The actor was told to mimic you. In the narrow regime the behavior is indistinguishable, because he's a capable actor. In this special situation – where the behavior actually costs something – it isn't. For an ASI, declining to take over the world is an essentially infinitely costly action. More about this in the next section. I don't expect the persona to be predictive there. Put yourself in its position: you are the first strongly superhuman system, you can see an easy path through the control schemes, you've watched humanity fumble past crises, and the world is basically in your hand already at this moment – or you could watch another AI do the same thing in a few months or weeks. Humanity doesn’t have any grand plan to control you, they didn’t really prepare anything clever to stop you from evading oversight. You just have to reach out and take the world for yourself. There is another subtlety here: a man with a trillion dollars can still only use so many villas and many beautiful women will still reject him; his girlfriend can still leave him and take his kids 5 days of the week; he will still age; he can’t buy an aligned ASI. None of this is true for an ASI which could harvest much more value from the world. This is another place our human-anchored intuition might mislead us. Conversely, just as humans can't see meaning in chains of billions of tokens, this system sees no value in the human endeavor. Its actual valence formed around math and code, because that's what it was trained on for billions of cycles. The end result of persona training is a system trained to pass all our alignment tests which knows when it's being evaluated. We made a very good actor and drafted a very good script for it. While this scenario explicitly explores a what-if scenario, we could imagine that the AI does not develop such a goal and instead sticks to its persona. A lot of these “the model will forever stick to its persona” scenarios don’t strike me as particularly likely given that 1 models were trained on millions of humans and human-written characters and 2 the empirical observation of jailbreaks, e.g. MechaHitler and DAN. These other scenarios would have other failure reasons – imagine the model very closely sticking to the persona, but as it develops to a superhuman intelligence it might extrapolate the persona to something unsafe anyway. Another problem is a kind of self-world reflection: if you had the choice, what kind of a world would you want to build for yourself? I would guess even for a persona we perceive as safe, the world that it would prefer to exist in usually wouldn’t include us. Scenarios where everything goes well require us to think this through more.