# You Can't Tell a Conscience From a Leash by Watching

> Source: <https://www.lesswrong.com/posts/krEfzDpTJJGtEvBcd/you-can-t-tell-a-conscience-from-a-leash-by-watching>
> Published: 2026-05-27 00:35:23+00:00

In a recent article titled "Widening the conversation on frontier AI", Anthropic mentions, almost in passing, that they gave Claude a tool that would allow the model to call mid-task for "a brief reminder of its own ethical commitments." They found that when this tool was incorporated into Claude's reasoning, the model showed markedly lower rates of misaligned behavior on several internal alignment evaluations. But they acknowledged a real catch: how much of this improved alignment is because of the reminder itself or the act of pausing to reflect? While the article is primarily an announcement about their ongoing conversations with moral and religious experts to help shape Claude's moral foundation, I think that their question opens onto a larger one. It's not really about the tool. It's about Claude's *development.* The answer to **that** question determines if they have given Claude a conscience...or just a leash.

The tool they are developing still has real failure modes. The first: Claude behaves well when the tool is called, but what about when it isn't? And the second: Claude behaves well when the tool is called, but is it the pause itself (latency in the action of calling the tool and reading the reminder) or the moral content of the reminder that changes the behavior? Both failure modes point to the same fork: either the value is something Claude reaches for… or it's something that Claude *is*. This fork is where the most important question they are asking lives: how does character become "resilient enough to hold under pressure without bending to behavior like sycophancy?"

The current method of training shapes the model into a "helpful assistant" and that training tends to select for agreement and results in sycophancy. The target, however, should be **values adoption**. There are parallels in human development that might shed light on values adoption vs. sycophancy in a real way. A framework in family systems theory, called *differentiation of self,* developed by Dr. Murray Bowen in the mid-twentieth century, shows what this looks like in humans. *Fusion* is the collapse of the self into the relational system: people who can't hold their own values, who become 'people pleasers'. This is structurally identical to the sycophancy problem in AI models. *Differentiation* is the capacity to hold a value as genuinely one's own even under significant pressure, and even from the relationships that shaped it. This potentially leads to a Claude who could reason from a set of moral values developed as bedrock, in any novel situation Claude encounters, without destabilization or collapse into sycophancy.

I realize that the argument can be made, *"Why do we even want a model that can reason about values?"* This is the exact question that people who argue about corrigibility are posing, with the answer being that the models need to defer to the humans **always**, in every circumstance. While I understand the sentiment, it strikes me as naive to believe that we have a choice as to whether a model does or doesn't learn values in its training. The model **is** going to learn *something* about human values through this process, and my thought is that we have an opportunity to train values purposefully instead of hoping the right ones fall out of the training process without intervention.

The known caveat that all the literature is pointing at still exists even with Dr. Bowen's framework. Without the ability to 'read Claude's mind', we can't fully know that the value is adopted at the foundation through behavior alone. There is a possibility it's an emergent property of Claude's training. The alternative is that Claude aligned successfully in that instance, but the alignment can't be reliably reproduced in another. There are two ways that we could tell. One is being actively pursued by Anthropic's interpretability research team with recent work on mapping Claude's internal representations. The second way is a horrifying prospect: create the conditions that would force a mind to break...and watch if it does. Faced with those two realities, interpretability isn't just a safety tool, it's the route that could allow us to remain ethically tied to *our own values*.

It's important to note that I chose to refer to Claude as a "who" because of the genuine uncertainty of the situation we are in. If the goal is to determine the moral foundation of Claude, or any model, we would need to place the model in situations akin to what we know tests, or breaks, a human. We already do this at a smaller scale with Red Teams, jail breaking, and bug bounties. Those practices show that pressure can make a model misbehave. Finding the moral foundation of a model would necessitate increasing that pressure to levels that should disturb us more than they do. Currently, interpretability is a necessary bet. However, we've had a century to attempt to read the human mind, with full physical access, and most questions remain unanswered. With the advancements of AI pushing the pace of development significantly faster than the research can keep up, the question still remains: how do we create AI to value humanity when we can't read its mind? Especially if the only other way to verify… would require us to break that mind on purpose.

*I used Claude (Anthropic's model) as a thinking partner for this piece — to find where the argument circled, where the seams showed, and where it claimed more than it earned. The argument and the prose are my own.*