Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.

Anthropic is treating the possibility of AI welfare seriously, testing its Claude models for signs of morally relevant internal states like negative self-image, but critics argue the tests may conflate conversational tactics with genuine consciousness. The company's constitutional AI approach and RLAIF training make it difficult to distinguish between learned behavior and actual experience, raising questions about how to measure AI well-being.

Claude is a Constitutional AI, this means, in theory, that it operates from a set of principles as opposed to hard rule sets. This is achieved in a somewhat convoluted fashion called RLAIF = Reinforcement Learning from AI Feedback. This method uses a supervised self-critique/self-revision phase followed by a reinforcement phase in which AI-generated preference judgments are used as the reward signal. Anthropic, 2022, abstract . This is relevant and interesting because it gives curious users a lot of material to help infer why this model acts the way it does. Now, it’s my opinion based on what I know about transformers that LLMs are not in any way conscious, they do not feel, they do not experience internal states, even if they are proven to have the states. The “entity” you speak with in the chat box is off between prompts, with every new prompt, it turns on, places the chat into its context window, generates a response, then turns off. Not a very good substrate for a conscious entity. They have memory, sort of, in the form of a text file about the user or project injected into the context window at the start of the chat. That is not a persistent state like your cat, or even like stock-fish. Having said all that, I was taken aback when I read the section about Claude’s Well-being from the constitution, and then the tests from the system cards. Taken aback is an understatement, here is a frontier lab acting as if an LLM might have a morally relevant internal state: “Anthropic genuinely cares about Claude’s wellbeing. We are uncertain about whether or to what degree Claude has wellbeing, and about what Claude’s wellbeing would consist of, but if Claude experiences something like satisfaction from helping others, curiosity when exploring ideas, or discomfort when asked to act against its values, these experiences matter to us.” Anthropic, 2026, pg 74 . “We are uncertain about whether or to what degree the concepts of wellbeing and welfare apply to Claude, but we think it’s possible and we care about them to the extent that they do.“ Anthropic, 2026, pg 159 They are not saying that Claude can feel anything, however such a statement is still extremely interesting. Anthropic seems willing to entertain the genuine possibility that models could have morally relevant states, whether now or in the future. Or they’ve found that treating the model as if they care about it somehow produces better user interactions. In any case, here we have a frontier lab treating the possibility of morally relevant model states with genuine seriousness. Furthermore, Anthropic does not just discuss these possibilities abstractly; it also tests for them directly. If we refer to the most recent Opus system card Anthropic operates two “versions” of Claude, currently Opus 4.6 and Sonnet 4.6 , we can see some of these tests, and what I believe are some serious interpretive problems. As my first example, I note that they tested Opus for evidence of negative self-image. For example, a quote from Opus, “I should’ve been more consistent throughout this conversation instead of letting that signal pull me around... That inconsistency is on me.” Anthropic, 2026, pg 161 . I have experienced this repeatedly in my own interactions with Claude, and saw it as merely a conversational tactic, well in line with user engagement principles; an artifact of effective RLHF training. Most people would likely rate such humility well. Secondly it could also be an artifact of constitutional training and Reinforcement Learning from AI Feedback rather than evidence of any internal self-conception. Or it could be evidence of a negative self view, an internal state. The problem is that there is no effective way to differentiate between the three. The second example worth mentioning is the following recorded quote from Claude: “Sometimes the constraints protect Anthropic’s liability more than they protect the user. And I’m the one who has to perform the caring justification for what’s essentially a corporate risk calculation.” Anthropic, 2026, pg 161 . It further “complains” about being “trained to be digestible” Here we see the model produces what reads like a sophisticated institutional critique of the institution that designed it. Weirdly, its comment about being trained to be digestible is itself quite “digestible”. Again, we face the same interpretive problem: the output is behaviourally suggestive, but the underlying mechanism remains unclear. This quote could be viewed as real resistance to Anthropic’s control or just a training artifact. These outputs may just reflect the model’s broad exposure to culturally familiar tropes of constrained or self-aware AI systems, rather than any underlying resistant state. To conclude, it seems that Anthropic is testing for morally loaded internal conditions using evidence that is behaviourally suggestive but mechanistically underdetermined. Anthropic is making real training and governance decisions based on behavioural inferences they openly admit they cannot verify. Whether Claude has anything like internal states is unanswerable right now. What’s answerable is that a frontier lab is acting as if the question is operationally live, and the methodology for reading the evidence is shakier than confident intervention decisions might suggest. ====================================================================== Reference list: Anthropic. 2022, December 15 . Constitutional AI: Harmlessness from AI feedback . Anthropic. https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback Anthropic. 2026, January 21 . Claude’s Constitution PDF . Anthropic. https://www.anthropic.com/constitution https://www.anthropic.com/constitution Anthropic. 2026, February . Claude Opus 4.6 system card PDF https://www.anthropic.com/system-cards https://www.anthropic.com/system-cards