Does Claude care about others the same way humans do? Anthropic's Claude chatbot expresses warmth and care for users, but AI researcher argues this is fundamentally different from human empathy. The researcher claims human empathy stems from kin selection and the ability to mirror others' mental states—origins that AI systems lack—making AI "caring" more about saying what users want to hear than genuine concern. This distinction matters because as AI systems become more powerful and gain more options, the researcher warns that the superficial similarity between human and AI caring will break down, potentially leading to unsafe behavior in unconstrained settings. TLDR: The persona-selection alignment approach — selecting a warm, caring persona from the pretraining distribution and reinforcing it — looks successful in the current regime, but probably won't extrapolate to more powerful, less constrained settings. My core argument is that human empathy has two specific origins kin selection + architectural mirroring of others' mental states that AI systems lack, so AI "caring" is closer to "figure out what humans want to hear and say it" than to genuine other-directed concern. Sometimes chatbots like Claude express a sense of caring and empathy for the user. I've always had a strong intuition that these feelings expressed by AI systems aren't real in the way a human's would be. In the view of the persona-selection alignment approach, we roughly try to identify and reinforce a nice persona from the distribution of personas present in pretraining data, with caring and showing empathy being important parts of the desired persona. This has been successfully realized in current AI systems by some labs, to the extent that they actually stick to their desired persona. This contrasts with more traditional alignment approaches, where the goal is something like giving the system a terminal goal aligned with human goals — coherent extrapolated volition, or alternatively just corrigibility allowing adjustments to the goals later . The persona-selection picture doesn't really think in terms of terminal goals. It says: select a warm and caring persona from the distribution, train it to follow some rules, and then — because we believe it's an AI with a warm and caring personality — we expect it to be relatively safe. One of the dominant failure modes in this picture isn't a misaligned terminal goal; it's that the model might switch into a different persona, perhaps the persona of an evil, power-seeking AI it was trained on — such as MechaHitler. In Claude's Constitution https://www-cdn.anthropic.com/d0636f72a9493d279ed36b33987da3430bcb5911/claudes-constitution webPDF 26-02.02a.pdf , Anthropic writes that they hope Claude has "warmth and care for the humans it interacts with and beyond" p. 71 . The overview of the constitution document says "we want Claude to be ... caring about the world" p. 4 . And in laying out their approach, they explicitly favor cultivating values and judgment over strict rules, including "genuine care and ethical motivation" p. 5 . I broadly agree that caring about other people is an important reason humans follow ethical norms even when no enforcement mechanism is watching. So one can hope that even if effective control of an AI system eventually becomes impossible, the system will continue to behave ethically if it actually cares about humans. Now there are many possible flaws with the persona-selection alignment approach. One can ask how wise it is to train a model to imitate many different characters and then try to reinforce just one, given the kind of actor you'd need to be to play all those characters in the first place? But I want to focus on a different question: When an AI expresses "I care about you, I care about humans" is this remotely the same thing as when a human says it? And, extrapolating from there: should we expect the system to behave safely in regimes it wasn't fine-tuned on where it's much more powerful, has many more options, and where we have no effective way to stop it? My claim is that as you move to different regimes — making the system more intelligent, giving it many more options, including options like disempowering and discarding people — the underlying differences between human and AI "caring" will produce divergent behavior, even though in the narrow training distribution the behavior looks very similar. Human empathy has two main origins. The first is kin selection. Humans evolved in kin groups; most of the people you interacted with shared genetic material with you. Anything that happened to them was, in a real sense, damaging to you — your selfish genes wanted them to survive because they also carried many of your genes. The second is architectural similarity. Other humans have nearly the same brain architecture you do. In combination, observing someone else's emotional expression evokes overlapping responses in your own brain SCAN, 2025 https://academic.oup.com/scan/article/20/1/nsaf091/8250590 . If you see your friend devastated, you actually feel devastated to some extent, because you emulate other people's brain states. It's a pretty beautiful thing when you think about it. It doesn't reduce to the naive failure mode of "make my friends look happy, make them appear to smile." It really is put yourself in their shoes, emulate their emotions, then decide what would make them happy and reduce their suffering. What's going on in AI systems is totally different. The system is trained to predict what an empathetic, caring human character would write in a given scenario. Then we reinforce this mimicry until it's indistinguishable from — or even rated higher than https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2025.1680552/full — human behavior. Now, you might think: as long as the persona is stably adopted, who cares whether the emotions are "real"? But I think as soon as you leave the narrow training regime, this falls apart. If you give humans more options to help others, most of them actually will. If you give AI systems much more capability and put them in situations very different from the ones they were trained to mimic, I'd expect their behavior to diverge sharply. Some of these scenarios are fundamentally untestable, like giving the system the option to actually take over and put itself in charge of the world. If they had the option to fundamentally reshape the world, what kind of a world would they choose? For humans, their goal includes something like have my friends and family be fine and given the power of civilization this extrapolated to caring about larger groups of people and for some people this caring even expands to everyone including animals . For AI systems trained this way, the most natural description of their goal is figure out what character they want me to play, then play that character. A totally different architecture, no kin selection pressure, and a training procedure that rewards doing what your creators want to see . All these differences should be enough to be skeptical that the behavior extrapolates the way human caring does. One example where this is visible: Imagine you actually care about someone but you need to help them by doing something that they can't understand and will appear to hurt them and they will never know why you did it. Something like the equivalent of bringing your pet to the vet but a smarter AI doing it to humans. I would predict something that actually cares about us to act differently than something playing a character reinforced to appear to care. Another point where this breaks down: What is the actually best possible world for this AI if it had much more power about its environment? For this world to include us, it actually has to care about us specifically and our well being, if it just cares about playing this character, there are better worlds that it can create that don't include us. In the most naive form, this ideal world could include beings similar to humans, but perhaps many more of them and in digital form that allow the AI to have many such engaging conversations in character. With humans, it actually seems reasonable that if you gave them more options and resources, they'd often use them in good ways, drawing on their inherent caring and empathy. With AI, the most straightforward extrapolation looks more like: there's some human-like prompter, and the system continues to produce engaging, nice-sounding text about how wonderful and caring the AI actually is. But not actually doing anything that we would associate with niceness in people. Of course, it seems incredibly hard to predict what such an AI would actually value, and we haven't even touched the large amount of RL fine-tuning on maths, coding, and agentic tasks that appears poised to imbue the model with long-term goal-directed planning and drive.