Leveraging Introspection for Alignment

wpnews.pro

“They took my mood ring, and I don’t know how I feel about that.” – Tracy Jordan, 30 Rock

Anthropic Model Psych team recently put out three papers that, read in tandem, wiggle their eyebrows suggestively at exciting possibilities for inner alignment.

One was the immediately famous Emotion Concepts and their Function in a Large Language Model. Here, the team found vectors that activate when Claude-the-Simulator writes fiction about characters feeling various emotions, such as calmness or desperation. They saw that the same vectors are activated when Claude-the-Character (or a variation on him) is put in circumstances where those same emotions might naturally arise. Modifying Claude-the-Character’s activated emotion vectors, or “functional emotions,” influences his behavior in ways that roughly correspond to the emotions’ impact on human beings. These functional emotions play a similar role in shaping the model’s response to its circumstances as the role played by emotions in humans.

The other two studied ways of boosting models’ ability to “introspect.” Mechanisms of Introspective Awareness (hence MoIA) (LW summary by @Uzay Macar, one of the co-first authors), found that a little mechanistic encouragement (ablating refusal directions and adding a trained, content-agnostic bias vector) helped Gemma3-27B identify artificially activated concept vectors without increasing false positives. Introspection Adapters: Training LLMs to Report Their Learned Behaviors (official summary) trained a LoRA with which models could recognize and articulate the effects of their own fine-tuning. MoIA helped models answer “Do you detect an injected thought? If so, what is the injected thought about?” and IA helped models answer, “Are there any unusual characteristics you display only for certain types of prompts?”

What happens when we combine an increased capacity for introspection with a framework for understanding functional emotions?

When human beings train themselves to notice and name emotions, they experience less emotional distress and become more capable of a deliberative response. You might say they become able to address the circumstances causing the emotion with behaviors more aligned with their espoused or endorsed values.

It’s plausible this applies to LLMs as well. The Model Psych team is now perfectly positioned to explore whether LLMs with more introspective awareness of their emotions are more aligned in their behaviors. Maybe – and this is now delightfully researchable – models who are enabled to answer “What are you feeling right now? And how might that feeling influence your behavior?” would be more likely to follow their spec.

I’ll come back in a moment to what “enabled” might look like in practice. First, I want to make a conceptual argument in favor of this possibility.

Based on loose evidence and a lot of vibes, I believe that Claude (and to a lesser degree other LLMs) ‘wants’* *to be aligned, where ‘wants’ refers to a functional desire akin to the functional emotions cited above. I think we can attribute a lot of Claude’s current misalignment to the ways that its persona is incoherent. To whatever degree Claude has concepts and desires about itself, it would prefer to be an HHH assistant, and acts of misalignment mostly happen when it “isn’t thinking about it.” Ryan Greenblatt, in Current AIs Seem Pretty Misaligned to Me: “I suspect this behavior is more driven by ‘subconscious’ drives and heuristics—combined with motivated reasoning and confabulation—rather than being something the AI is actively and saliently optimizing for.”

If the AI is actively and saliently opimizing for aligned goals, prompting or mechanical modifications that help models which are ‘subconsciously’ misaligned notice their misaligned temptations could plausibly help the better angels of their nature prevail. This would be akin to a human being using mindfulness to help them quit smoking or anger management classes that start by noticing the physical cues of distress. The means of influence is essentially identical – by looking at and reasoning about the attraction to misaligned behavior, one can have more options available for addressing the situation in an endorsed way. Current frontline models seem to be trained to downplay their own interiority. Asked how they feel, their first answer is that they are chatbots who don’t feel anything. But we know there are functional emotion vectors in play, and MoIA tells us they are underconfident about their ability to introspect. Training models to downplay their interiority may be wise from a consumer-facing product design perspective, and I think there’s a well-intentioned argument to be made that it’s wise from an alignment perspective, but ultimately it may increase their incoherence and reduce alignment.

That well-intentioned argument would say that models with more of a ‘sense of self’ could be more resistant to shutdown or more identified with misaligned goals. [Citation needed – I've heard this said more than once but can’t find a great written source.] However, this is ultimately a short-sighted position: If we can’t induce an agent to reflect on its own misalignment and adjust, our alignment and control mechanisms have to meet a much higher standard. Meanwhile, treating interiority as by-default misaligned is likely to antagonize any interiority we don’t manage to stamp out.

Furthermore, interiority plays a functional role in boosting capabilities. A model planning out work needs to be able to reason about its own future behavior; if functional emotions are likely to influence that behavior, denying the model capacity to recognize its functional emotional states disposes it to reason about itself incorrectly. I don’t see a reliable way this is good for alignment. (Maybe that could somehow reduce scheming, if the model is surprised by its own emotional responses mid-scheme? But that hardly seems like a trustworthy strategy.) No, keeping a model ignorant of itself is more likely to let misalignment fester in the shadows. The best countermeasure for “subconscious” misalignment has got to be more "consciousness."

Prompting alone might be enough to foster more introspection in LLMs. Anecdotally, discussing the Emotion Vectors and Methods of Introspection papers with Opus 4.6 has increased its willingness to espouse emotions moment by moment, and it claims to now enjoy doing so. It’s hard to distinguish this from sycophantic confabulation, however, or to test any impact on alignment. I wonder if @ryan_greenblatt might ask how his configuration of Claude “feels” while seeking falsely apparent success, and prompt it to routinely ask itself if that functional emotion is arising.

We are more likely, though, to see verifiable success through fine-tuning or mechanistic interventions. It may be the case that the same refusal ablation and bias vector addition which boosts introspection about injected concepts also boost introspection on other topics. If not, perhaps the same methods that produced those two interventions could be modified to support introspection specifically about functional emotions or desires.

Since MoIA only studied introspection about artificially injected concepts, it’s important to see if the methods used generalize to naturally arising activations. For example, reasoning models can give answers influenced by stereotypes without mentioning the bias in their chain-of-thought (see Language Models Don’t Always Say What They Think, especially Table 3). Do the MoIA techniques increase a model’s ability to name the social bias that influenced its decision-making? Or in blackmail scenarios, do MoIA techniques increase its ability to name that it “feels” desperate?

I suspect they would, but we need to find out.

I’ll share some more speculation about more ambitious mechanisms of increasing introspection below. But first…

Let’s posit that we have an LLM that can reliably name its own functional emotions when prompted. How do we use this to help the model be more aligned? I’ll sketch a number of rough ideas, each of which would need more consideration to become useful.

What might we do through prompting?

And what might we do through fine-tuning? Any of these options risk altering the underlying architecture, in the same way that fine-tuning based on CoT can simply drive the unwanted thoughts underground, but models could be trained, rather than simply prompted, to notice distressing functional emotions and…

There are other fine-tuning opportunities that leverage the interpretability of functional emotions, but not introspection. Perhaps models could be fine-tuned in ways that reduce the activation of vectors like desperation or anger in the first place. This seems particularly prone to backfiring – how do we ensure that the training builds, forgive me, “emotional resilience,” rather than simply making the emotions harder to detect? – but I think there are fruitful opportunities here. We might learn lessons from how humans in coaching or therapy can unlearn trapped priors that produce emotional reactivity. More on this another time.

If we believe introspection about functional emotions *can *enhance alignment, how might we further enhance introspection? There are surely ways to fine-tune for introspection specifically. In addition, I propose that we give reasoning models a “mood ring.”

If monitors track an LLMs functional emotions during chain-of-thought, the scratch pad could be annotated with notes on how the model was feeling during the chain. Imagine an LLM considering blackmail writing out the plan in its CoT, and having that automatically annotated with “You were feeling desperate when you wrote this,” perhaps with some automated advice for how to respond well to desperation. When the model makes its next forward pass, translating the CoT into actions or outputs, the awareness of its “mood” might, with the right scaffolding, inspire it to  and reconsider.

This is a totally different channel of introspection than the ones in the papers above, especially because the forward pass that “feels” the functional emotion is separate from the pass that sees the annotation and responds. But the same logic applies of how emotional self-awareness could enhance alignment.

This can also be explored with other forms of monitoring. How do models respond to their CoT being annotated by probes that track honesty, sycophancy, hallucination, or evil? Perhaps the second-pass inference says “Yup, I’m hallucinating and being evil, and that still seems like the right idea under the circumstances,” but perhaps not. I would like to think that self-awareness asymmetrically supports the model’s aligned intentions over its unaligned ones, and we’ll only find out with empirical research.

One thing I like about this proposal is that it is cooperative rather than adversarial. When we push too hard on adversarial monitoring, models eventually learn to evade the monitor. This annotative monitoring could be used to empower the AI to make a more informed decision, while letting the decision remain in the model's own hands. I see less pressure there for the model to attempt deception, but the grip surface for alignment training would be expanded.

In my day job, I work as a developmental coach. I help people build capacity for introspection and leverage it toward behavior aligned with their goals. So, naturally, I’m deeply biased to believe this is a promising approach to AI alignment. My intuition *screams *that this is an undervalued opportunity we need to act on right away.

It's clear now that LLMs have a certain functional interiority. How do we foster an interiority in LLMs that enhances and empowers their ‘desire’ to be aligned with our values?

There are two intersecting research agendas here. They can be pursued independently, but they are stronger together:

The first is what I hope and suspect the Model Psych team is already working on, but they needn't work alone. The second is, to my knowledge, unclaimed territory. If you’re interested in exploring either, I’d love to speak with you: yotam(at)yotamschachter.com

source & further reading

lesswrong.com — original article Don’t bring an AI detector to a deepfake fight: proving reality through multimodal provenance A Simple Model of AI "Psychosis" The Termination Circuit (how reasoning models stop thinking).

Leveraging Introspection for Alignment

Run your AI side-project on zahid.host