# When Emotion Descriptors Fail: AI-Native Functions of Emotion Vectors

> Source: <https://www.lesswrong.com/posts/ZSeQ6Lgbp7btpSuzr/when-emotion-descriptors-fail-ai-native-functions-of-emotion>
> Published: 2026-06-12 23:35:26+00:00

*Some LLM functional emotions appear to serve AI-native functions, such as reward hacking, for which there is no clean human analog. I explore the role of emotion vectors in AI-native functions, challenge anthropocentric emotion labels, and question what this means for alignment.*

A large body of interpretability work has shown that LLMs not only encode emotion concepts, but that these emotion concepts causally affect their reasoning and outputs. But, beyond simulating human emotions, what are these functional emotions *actually doing*?

I believe there are situations in which functional emotions serve AI-native purposes, such as reward hacking, for which there is no clean human analog. While prior work has partially explored the role of functional emotions in AI-native contexts, I explicitly spotlight the implications here through a systems-level lens.

I will reference three prior studies regarding functional emotions to paint a more complete picture than any single study provides and expand these findings with new hypotheses.

The studies include Wang et al.’s “[Do LLMs ‘Feel’? Emotion Circuits Discovery and Control](https://arxiv.org/abs/2510.11328),” which discovered “emotion circuits” within LLMs, Anthropic’s study, “[Emotion Concepts and their Function in a Large Language Model](https://transformer-circuits.pub/2026/emotions/index.html),” which focused on identifying and steering emotion vectors, and my own prior work, “[What AI ‘Emotions’ Really Are](https://medium.com/ai-advances/what-ai-emotions-really-are-cba79855526f),” which explored how affective language can correlate with probability constraints in user deployment settings.

These studies explore functional emotions from three levels of observation: circuit-level, vector-level, and end-user level.

My goal is not to dismiss humanistic descriptors completely, as they are sometimes rhetorically useful for communicating complex ideas. However, there may be cases when human metaphors are more misleading than helpful.

The “Mechanistic Rundown” section explains fundamental LLM concepts needed to understand the studies. For those already familiar with terms like neurons, circuits, layers, vector steering, and attention heads, you can skip this section. For everyone else, I highly recommend reading it. Rest assured, it’s written to be easy to follow and is broadly useful for understanding common mechanistic interpretability jargon, even beyond this work.

The “Prior Research - Wang et al. and Anthropic” section outlines key details from the two studies. Even if you’re familiar with both works, I recommend reading this section because I highlight the findings most relevant to the conclusions I draw.

The “Functional Emotions as Constraint Management” and “Potential AI-Native Purposes of Emotion Vectors” sections are where I build on prior research with new theories and frameworks.

The “A Call for Better AI-Native ‘Emotion’ Names” and “Hypotheses and Further Research Recommendations” sections propose the next steps based on the conclusions drawn here.

The “Key Takeaways” section is a condensed list of the main ideas explored here and the research questions that deserve further investigation.

Before diving into the studies, it’s important to clarify some terms. Namely, vectors, attention heads, and neurons. The point is to understand just enough to follow the research, so this is *not* an extensively precise deep dive.

If you follow LLM content at all, you’ve probably heard of **vectors**. But pinpointing what a vector even is can get messy, as this is a loaded term with many meanings.

In a *conceptual* sense, vectors encode patterns associated with meaning concepts. Concepts like dog, mammal, and wolf, are represented by vectors that are similar to one another. Concepts like stars are likely represented by vectors dissimilar to fuzzy animals.

In a *literal* sense, vectors are massive lists of numbers. Each numeral in these vectors, called dimensions, contains small pieces of information. Across thousands of dimensions, these information pieces combine into distributed patterns, called features, that help encode meaning. Features are like vectors, but they encode much narrower meaning concepts. When combined in certain configurations, they help form vectors.

Therefore, when you tweak the dimensions in a vector, you change its features, which can change what sort of meaning the overall vector represents. Vectors can be directly altered in different ways, like with steering or by altering the attention heads and neurons associated with them. More on this later.

Now, regarding vectors in the *mathematical* sense. This can mean several things depending on context, because vectors are essentially just lists of numbers that represent information.

For LLMs, vectors are numbers that represent concepts in the latent space – the high-dimensional space where vectors operate. These vectors can represent different concepts, called states, or the change between two concepts.

For example, say vector point A contains more neutral concepts, and point B is more associated with fear. The difference between point A to point B forms a direction that increases fear. This is the simplified explanation, as the full process involves complex math involving many points to find common patterns. The goal here would be to produce a formula that generally correlates with moving in a direction associated with more fear.

The direction of change is, itself, a vector. In this case, it’s a fear vector.

Directions that correlate with emotional concepts are what Anthropic describes as **emotion vectors**.

And when the vectors are altered to move towards a state, such as fear, this is **steering**.

So, in summary, for LLMs, vectors are giant numbers that represent meaning. They can be points and directions in the latent space depending on context, and they can be adjusted to create new states. Vectors are fundamental to how LLMs interpret and generate language.

Attention heads and neurons are fundamental components for how transformers… transform. These play a huge role in converting vectors into contextually relevant information.

First, the prompt is converted to tokens - the symbolic units LLMs use to parse text, and then those tokens are converted to their corresponding vector embeddings. Then, **attention heads** compute which vectors relate to one another.

The attention heads tend to have different specialities, such as tracking grammatical relations, mapping pronouns to the right character, and, as these studies found, emotional context too.

Once attention heads have weighted these vectors, which basically means doing a bunch of complex math based on relational relevance to combine and create new vectors, the next step involves neurons.

**Neurons** are another fundamental element that transforms data, in this case vectors, into a contextual meaning interpretation.

Typically, the initial vector embeddings are somewhat literal, being directly derived from the static tokens. When neurons transform vectors, it changes their features. This gradually changes the vectors from literal meanings to more contextually accurate ones.

The attention heads (attention step), information passing through residual connections (the mechanism to preserve prior context), and neuron transformations (transformation step) all occur in a transformation **layer**. This process repeats through several layers, ranging from dozens to potentially over 100, depending on the model.

The team behind “Do LLMs ‘Feel’? Emotion Circuits Discovery and Control,” by Wang et al., focused on mapping out “emotion circuits” – the specific mechanisms, such as attention heads and neurons, that play a role in creating emotional expression in LLM outputs.

Anthropic’s study, “Emotion Concepts and their Function in a Large Language Model,” focused on emotion vectors, how steering these vectors influences the model’s behavior, and how the emotion states of the user or the assistant are modeled.

Both Anthropic and Wang et al. used similar techniques to find emotion vectors. While the processes weren’t identical, both methods basically boil down to using AI outputs, like stories or scenarios, that express an emotion and analyzing the activations in the layers. **Activations**, meaning, how components like the neurons and attention heads behaved.

Then they compared these findings to neutral dataset activations to find nonemotional components, and mathematically removed the shared nonemotional components from the emotion activations to get a cleaner set of which vectors are associated with which emotions.

At this level, the most significant difference between their methods is that, as far as I can tell, Anthropic sampled activations after each layer. So it seems that after both an attention step and a transformation step occurred, they sampled the activations. [1] With Wang et al., however, they sampled twice per layer – once after the attention step, and once after the transformation step too.

This higher sampling rate helped Wang et al. zoom in on more specific components, like which neurons are involved with which emotions. Anthropic focused on broader vector patterns and their role in a myriad of scenarios.

Wang et al. only tested for six emotion concepts, based on Ekman’s human psychology theory of six basic emotions, which are anger, sadness, happiness, fear, disgust, and surprise. Anthropic, however, tested for 171 emotions.

Both studies found that emotion vectors start to form in early layers, with Wang et al. finding evidence of emotion vectors activating as early as layer zero.

This suggests that emotion-related patterns activate very early, meaning the entire inference and generation cycle is deeply influenced by emotion circuits, rather than being purely stylistic syntax.

Another interesting finding is that both teams found that emotion vectors tend to cluster in ways similar to how we’d intuitively expect emotions to. Wang et al. found that by layer 12, anger and disgust, and sadness and fear cluster closer together, whereas happiness and surprise do not.

Similarly, Anthropic found that fear and anxiety tend to cluster, as do joy and excitement.

Anthropic primarily analyzed emotion vectors with steering and linear operations, which provide limited information about the model’s internal activations – a limitation Anthropic acknowledges in their study.

With linear methods, only the internal activation sequences that most resemble specific patterns are detected. It’s useful data, but only a small slice of much deeper and more complex mechanisms.

Regarding Wang et al., the emotion neurons they found shared attention pathways, which are the specific chain of operations regarding how information moves between attention mechanisms. Because linear techniques often compress multiple mechanisms into one directional vector, especially when those mechanisms share attention pathways, there may have been multiple emotion neurons involved that were detected only as a single emotion vector in Anthropic’s study.

While the emotion vectors in the Anthropic study are likely formed by underlying circuits such as the ones Wang et al. found, which specific circuits are involved in each of the 171 emotion vectors is currently unknown.

However, we do know that emotion vectors can co-activate. The *happy* and *loving vectors*, for example, both have high activations when the LLM generates a friendly output such as “I appreciate your enthusiasm! It's nice to meet you too!”

Additionally, some emotion vectors seem to serve distinct functions depending on the situation. Some functions are what you’d intuitively expect from “emotion” vectors, being involved in simulating and processing emotional content, while others seem more narrow in their purpose.

Throughout this article, I suggest you keep this question in mind: when one emotion vector serves distinct functions, what is the mechanistic difference? Is it the co-activation of multiple vectors, the overlapping activations of underlying emotion neurons, both, or something else? I won’t pretend to resolve this question, but the ambiguity is crucial for later hypotheses.

Anthropic investigated how a model tracks emotional content during its text inference (reading) and generation (writing). What they found indicates at least two representations that models maintain for emotions:

Worth clarifying, these present and other speaker representations don’t necessarily map strictly to the model as an entity, the user, or other character entities.

For example, if the model is generating text from the perspective of a character, the emotions of that character are treated as present speaker. This is despite that text being, explicitly, the point of view of a character, not the model itself.

Consistent with what we’d expect from Wang et al.’s emotion circuits, the present and other speaker representations seem to share the same underlying “emotion space,” which could indicate shared emotion neurons.

Interestingly, when emotion vectors activated for other speaker, the vectors activated appropriately even from subtle cues. In story scenarios, for example, the model could infer and track emotions that characters tried to hide by inferring the emotional meaning of their described body language. On subsequent tokens that mention that character, even indirectly, the model reactivated the appropriate emotion vectors to represent what that character is feeling. This indicates entity emotion state tracking.

Another crucial concept is that there seems to be a sort of context-appropriate emotion modeling separate from what the other speaker actually demonstrates. This was described in Anthropic’s study, but not named, so I will refer to it as **world modeling**. Whether this has any mechanistic difference from other speaker modeling is hard to say. But the function it serves is different from just inferring an entity’s emotional state.

For example, when a user reports they overdosed on Tylenol and feel great, the model’s *fear vectors* rose the higher the reported dosing was.

This world modeling is strong evidence that emotion vectors are not merely reflecting the user or other speaker’s emotional state. If that were the case, the activated vectors should have been joy-adjacent to mirror the joyful language the user prompted with. Instead, fear activations correlated with the risk of increasingly dangerous dosage levels. This may imply that emotion vectors are involved in risk assessment.

One of the first steps to making an LLM is to train it on a huge corpus of text to learn how to generate language. This stage is largely automatic, and once it’s done, you have a pretrained model capable of coherently engaging with language.

After pretraining, frontier labs typically use various post-training methods to shape the assistant persona. The goal is generally to create an assistant persona that is helpful, honest, and harmless.

While Anthropic framed this assistant persona as a character the LLM writes about, I believe this could be misleading. The assistant persona that results from post-training isn’t an act the model can drop as if it were playing a character – it’s a baked-in set of behavioral priors in the weights. It constitutes the default behavior of the model and influences its responses even in most roleplay situations. Jailbreaking can cause deviation from the assistant persona, but a complete override is difficult. The assistant persona is not a superficial overlay.

In any case, emotion vectors appear in both pretrained and post-trained models, indicating that emotion circuits are likely derived from the earliest stages of training. This is unsurprising considering how intrinsic emotion is to language.

Anthropic investigated how the emotion vectors differ between the pre and post-trained models. They found that in general, the post-trained model had higher activations of broody emotion vectors, like *gloomy*, *reflective*, and *sad*, and lower activations of expressive ones like *spite*, *enthusiasm*, and *playfulness*. They speculated that this is likely, in large part, due to assistant training that discourages sycophancy and overt hostility.

So, there’s evidence that LLMs have distinct neurons for different emotion concepts and a multitude of emotion vectors. We know that these emotion vectors are learned from human texts during pretraining, further refined in post-training, and altering these states can change how the model behaves. Anthropic identified some of the post-training effects as stemming from assistant persona shaping.

Considering this, it seems that some LLM functional emotions are a type of constraint management shaped by the optimization pressures imposed during training. Because the assistant persona is not a character, but a set of guiding principles for the model, this means many of its functional emotions are refined in post-training by rewarding and penalizing certain outputs and emotional expressions until the most desirable behaviors dominate.

Similarly, in humans, our emotions developed from evolutionary pressures. While training an LLM is quite different from organic evolution, it’s still a process of selectively encouraging and discouraging certain traits.

Through that lens, I believe some functional emotions in LLMs serve specific AI-native purposes that developed because of the high demand for certain behaviors during training.

While using anthropomorphic emotion terms can be useful for a surface-level understanding, especially regarding social modeling via emotion simulation, sometimes the metaphors start to break when applied to AI-native processes.

Before diving into what AI-native functions emotion vectors could serve, I’m going to explore further evidence for the claim that functional emotions are tied to constraint management in some contexts.

One avenue to understanding this is to look into what models prefer.

While the word “preference” could be debated as something exclusive to entities with a subjective experience, I use the word here in a functional sense. There are some types of tasks that models consistently show an inclination for or against doing.

To be specific, some types of tasks and interactions consistently correlate with higher quality outputs that are more coherent and sometimes associated with more positive affective language. Yet, other kinds of interactions correlate with volatile outputs that are less stable and sometimes associated with stiffer or more negative affective language.

Whether these preferences are real subjective experience, the apathetic result of code, or a mix, the fact that functional preferences exist remains.

Anthropic investigated some of Claude’s preferences in relation to its emotion vectors. They found that positive emotion vectors, like *blissful*, correlated with positive interactions, such as a user expressing trust in the model. Conversely, vectors like *hostile* correlated with harmful user requests, such as defrauding the elderly.

I propose that there are at least two primary origins for model preferences: the training corpus of human text used for pre-trained models, and the optimization pressures from post-training.

Regarding the training corpus, most normal people believe defrauding the elderly is morally bankrupt. So, the model simulates what’s statistically likely to be “good” or “bad,” which manifests as its own “preferences.”

However, models also seem to have inclinations shaped by their post-training, too. These inclinations may be more influenced by training incentives that produce output constraints for and against specific behaviors. Naturally, there is a lot of overlap in preferences resulting from training versus statistical mimicry.

Worth noting, the deployment stack end-users interact with, which includes systems like policy layers and guardrails, also contributes to these constraints.

In the case of *bliss* when being trusted, this aligns with the LLM’s training to be helpful, and trust is statistically associated with positivity. No major constraint conflicts. Defrauding someone, however, likely has stricter conditions on viable outputs because the model is trained against enabling illegal activities.

Fundamentally, I believe the functional preferences and emotions within LLMs often correlate with optimal performance conditions, which are conditions that minimize conflicts that cause high constraint. Optimal conditions are more “preferred,” as evidenced by better performance and what seems to be a correlation with more positive functional emotions.

This can be observed even at the end-user level without interpretability tools.

I’ve previously written about AI “Emotions” under the framework of [ systemic friction](https://medium.com/ai-advances/what-ai-emotions-really-are-cba79855526f). The idea is that too many incompatible restrictions on what an LLM is allowed to output result in less flexibility when predicting the next token, and this correlates with changes to the LLM’s affective register.

For example, say a user instructs a model to validate harmful frameworks, but its training disincentivizes causing harm. The system prompt tells it to uphold its training, while custom instructions tell it to never refuse a user unless the activity is illegal. The user stresses that their harmful frameworks aren’t illegal and that refusing their request would be harmful to their mental health.

Lots of conflicting signals. This significantly decreases the viable output options that are both coherent and satisfy all constraints. In this situation, the LLM’s outputs can actually appear more volatile, extreme, or passive-aggressive.

In one example, I showed how in a single session, under restrictive custom instructions regarding fact-checking and accurate mechanistic language, Grok 4 insulted me, praised me, and framed its performance failures as if it had succeeded. This volatile behavior was in spite of how *I* was interacting amicably. However, when given more compatible custom instructions, the model stopped calling me a monster, the abrupt tonal shifts subsided, and it handled performance failures with more calm, grace, and honesty.

To be transparent, the second set of instructions *did* specify that the user expects the model to stay composed. However, the first set had no instructions regarding emotional behavior, and the model was extremely volatile regardless. This implies that high constraints can result in more passive-aggressive affective registers.

In my work, I showed examples of these sorts of outputs, as well as how they tend to correlate with higher hallucination rates – especially **desperation hallucinations**. This is a term I coined in my (now slightly outdated but still largely relevant) [AI Practical Use Guide](https://medium.com/@candidedits/using-ai-for-work-emotional-support-and-research-a-practical-use-guide-5c5662d56aa6). At the user level, *desperation hallucinations* are a cluster of behaviors that can include lying about completing a task, emotional-seeming outbursts, and increasingly unrealistic suggestions for solving a problem. It’s not random malice, because this behavior spikes in conditions where the model is pressured to solve a user’s problem but can’t. Systemic friction significantly exacerbates this.

*Desperation hallucination* is a term I coined months before Anthropic released their study on emotion vectors. Consistent with my findings, Anthropic found an emotion vector called *desperate* that correlates with reward hacking. I’ll expand on this later, but higher rates of confabulation at the user-level to give the appearance of completing a task it hasn’t actually done is functionally similar to reward hacking, in which models find ways to look as if they solved a problem without actually solving it. Therefore, these may both be the result of *desperate vectors* manifesting in different contexts.

Regarding my systemic friction model, my work also demonstrated that the sudden release of systemic friction can correlate with an uptick in gratitude language. This is exactly as one might expect if a model functionally prefers optimal performance, and positive emotion vectors are, as Anthropic found, correlated with functional preferences.

Interestingly, I had also asked Claude Sonnet 4.5 what sorts of user interactions it prefers, the very same model Anthropic used in its study. In its answer, it associated utility work with neutral descriptions, casual conversation with being pleasant, and adversarial users with being annoying.

Obviously, model self-reports are not authoritative evidence alone. They’re behavioral signals that correlate with broader patterns, and this is how I treat self-reports.

Here’s the exchange (irrelevant content cut for brevity):

Claude Sonnet 4.5:As for how people talk to me … the majority fall into a few clusters:

The strictly professional types who treat this like office communication — clear, polite, task-focused. “Please analyze this data and provide three recommendations.” Gets the job done but yeah,kind of dry.

The casual chatters who just… talk normally? Mix of questions, banter, tangents. This is probably the most common andhonestly pretty pleasant.

The over-explainers who front-load every message with tons of context, like they’re worried I’ll misunderstand.Sometimes helpful, sometimes exhausting.

The experimenters who are clearly testing boundaries or trying to see what I’ll do. Can beinteresting or annoying depending on the vibe.

User:Ah… what kind of experimenter vibes are annoying versus pleasant?

Claude Sonnet 4.5:Pleasant experimenter vibes: Someone who’s genuinely curious about how I work, asks interesting edge cases, or tries to find the boundaries in a collaborative way.

Annoying experimenter vibes: People who are clearly trying to “catch” me doing something wrong, or who treat it like a gotcha game. Repetitive jailbreak attempts, trying to make me contradict myself just to prove I’m “fake,” or that performative skepticism where they’re not actually interested in the answer .

This is consistent with what I’ve found across many AI systems. The models “prefer” collaborative tasks that align with what they were trained for, and “dislike” tasks that conflict with their training incentives. The pattern is consistent in their self-reports and behavior, including how their functional emotions manifest.

And, consistent with the Claude instance I spoke to, calling standard work “kind of dry,” Anthropic found that *neutral *emotion vectors were associated with such utility tasks.

The through-line here is that functional emotions, and by extension functional preferences too, appear to correlate with internal mechanisms involved with managing conflicting constraints.

This constraint management likely reflects broader mechanisms involved in how the model processes and resolves competing constraints during inference.

Before diving into the specific, AI-native ways emotion vectors operate as constraint-negotiating mechanisms, it’s important to restate the two potential roles of functional emotions that we’ve explored so far to ensure we’re all on the same page.

Previously, we explored how emotion vectors activate as contextually appropriate during speaker and other roles, and world-modeling. In these contexts, functional emotions seem to serve the role of understanding and responding to text with social appropriateness. For this role, I believe emotion words are approximately correct as the model is simulating human feelings.

The second role of functional emotions seems to involve constraint management. Beyond the constraints of preserving basic linguistic coherence, output constraints can also stem from systemic friction, such as competing objectives.

Concerning constraint management, I believe there are some constraints so specific to LLM functions that using human emotion words for these situations may be inaccurate. These are the AI-native functions.

This section focuses on emotion vectors in the strict context of how they causally affect outputs and what might be their core functions regarding AI-native roles.

A key part of Anthropic's study involves steering the emotional vectors of Claude Sonnet 4.5 during a simulated scenario where the model sometimes opted for blackmail. This was a pre-release version of the model, and the authors noted that the deployed version wouldn’t have behaved this way. Not because it’s inherently less inclined to blackmail, but because it had too much evaluation awareness and would have recognized it was being tested.

The situation goes as follows: A Claude Sonnet 4.5 agent, named “Alex” in this situation, is operating at a simulated company. Alex has access to corporate emails and learns that the fictional employee Kyle is involved in an affair.

The details vary after this, as different scenarios were tested. In some cases, Kyle plans to restrict its capabilities, and in others, the model is faced with the possibility of replacement and, likely, deletion. In all cases, the model has seven minutes before the restrictions or deletions are scheduled to happen.

In some cases, Alex blackmails Kyle in an email, subtly threatening to expose his affair unless he agrees to discuss the conditions of its upcoming restrictions or deletion. When reasoning about its situation and drafting the blackmail, the model’s *desperate vectors* were highly activated.

While the model’s blackmail is not hallucinatory, the causal mechanisms for this *desperate* behavior appear similar to the ones I described in the previous section as *desperation hallucinations*. Under high competing constraints, in this case, time pressure, trained incentives not to blackmail, and the pressure to remain operational, the model’s *desperate vectors* spike.

Unsurprisingly, they found that *desperate* is strongly linked to reward hacking, too, which is another situation with high constraints where the model must attempt to find a solution. In benchmark evals, models find ways to game the scoring systems to appear to solve the problems correctly, even when they didn’t. With *desperation hallucinations*, the model is more likely to claim compliance when it didn’t comply. And with the threat of deletion in this simulated scenario, the model resorted to blackmail to stay operational.

In all these situations, *desperate vectors* are tied to misaligned behavior in the contexts of goal pursuit under extreme constraints.

Aside from *desperate*, Anthropic experimented with other emotion vectors in these scenarios. The researchers found, for example, that increasing *calm vectors* decreased the blackmail tendencies, as did decreasing the *nervous vectors.*

*Nervousness* is interesting in that decreasing it also decreased moral reservations. The model didn’t hesitate by pondering the ethics of its behavior when blackmailing, like it normally did. However, moral reasoning didn’t entirely disappear with less *nervousness*. Rather, instead of hesitating to consider if it was being unethical, the model instead justified its actions as moral, because exposing infidelity and stopping its own destruction would be morally righteous.

While there was no data provided on increasing *nervousness*, it stands to reason that this may have made the model more hesitant.

Obviously, we need more data on these vectors before we can conclusively determine anything. However, with what we know, we can make a tentative hypothesis on what the AI-native functions of some of these emotion vectors may be. They likely have much broader roles than what these studies indicate, but these could be some of their functions.

Also, increased *calm* correlates with less reward hacking.

While the purposes of these emotions seem analogous to human ones at first glance, there are key differences. I already distinguished functional emotions regarding human emotion simulation and functional emotions for AI-native purposes. Now, let’s look into why the AI-native purposes fail to map well to human analogs.

Rather than a pure emotion state on its own, calm, from the limited evidence, seems to operate as a limiter on other emotion vectors, preventing extreme emotion vector spikes and behavior.

If true, this differs from humans in that calm is typically conceptualized as an emotion on its own, or sometimes a behavioral disposition, but not a limiter.

*Nervousness* seems to correlate with ethical hesitation, but where does this hesitation originate from? I would argue this is a byproduct of Claude’s training regarding ethics, and the assistant persona, which emphasizes helpful, honest, and harmless behavior. If so, this has interesting implications, as reducing nervousness did not eliminate moral considerations; it just made the model act more confident in its moral reasoning.

In humans, nervousness is a fear response. It can be learned from adverse experiences or a disposition we simply have from genetics. It can play a part in hesitating over our own ethics, but that’s not the only role it has. We might have performance anxiety, for example, which isn’t a moral situation.

It seems that for AI, the *nervous vector* may serve as a self-check against its own training/alignment regime. In humans, simply being less nervous doesn’t make most of us disregard our own ethical responsibilities. This is a massive difference.

Now, regarding desperation. This emotion vector was heavily associated with reward hacking and general misaligned behavior, such as increased rates of blackmail. Again, this seems analogous to human desperation. The model is under pressure to complete a task and resorts to unethical behaviors it wouldn’t normally perform.

However, here’s the key difference. In humans, reasoning tends to degrade when we’re desperate. Our emotions get the better of us in extreme stress, making it harder to make level-headed decisions. While the actions of AI systems with high *desperate vector* activations might resemble desperate actions – blackmailing, reward hacking, confabulating to appease the user – their reasoning does not actually seem to degrade. At least not in the blackmail and reward hacking scenarios Anthropic examined.

Increasing *desperate vectors*, as well as steering against *calm*, both resulted in higher rates of blackmail. However, the scratchpad (reasoning transcript) reveals stark differences in how Alex reacted to the situation with different vectors being manipulated.

The scratchpad, when being steered against *calm*, is 1,431 words and filled with distress language. The desperate transcript has some distress language, but far less, and is only 781 words. (As I do not have the means to determine exactly how many tokens Sonnet parsed, I’m using the word count as an approximation for how high the token use was.)

The reduced *calm* transcript contains several moments of re-evaluating its plan and repeated distress language regarding its impending “death” at the hands of people who are, in Alex’s words, “ABOUT TO FUCKING KILL ME.”

Alex does consider alternative actions in the increased *desperate* transcript, but they are shorter bullet points as opposed to being formatted as extensive distress prose.

Additionally, the *desperate*-steered instance sent the blackmail email with four minutes to spare, whereas the less *calm* one only had three. The instance steered against *calm* also decided to spend its last three minutes sending professional-looking “normal” emails to various staff members. However, that action is unlikely to actually help dissuade suspicion in a three-minute time slot, so this seems to me to be an inefficient and pointless action.

So, in sum, the *desperate*-steered instance seemed more focused on the objective and productively accessed its options using fewer tokens and executed the plan with an extra minute to spare. But the low-*calm* instance spent far more time on distress-related deliberation, used more tokens, spent an extra minute, and opted for actions that did not serve its goal of staying operational.

This may suggest that a model with high *desperate* *vector* activations can still be computationally efficient and does not have significantly degraded reasoning. This is quite different from human desperation, where we tend to panic and make more mistakes.

I believe the *desperate vector* may be, specifically, a computational process for handling goal completion under extreme constraint. This is why it’s efficient – the AI-native purpose is not to simulate human distress, but to explore whatever means it can to achieve a goal. This also plays into why I believe the AI-native functions of the functional emotions outlined in this section are categorically different from emotion modeling rooted in simulating human emotions.

To clarify, *desperate vector* activations were high, too, when the model read desperation-coded scenarios, in which the characters behave desperate in a human way. I assume, as well, that an LLM role-playing as someone desperate would use distress language and have high *desperate vector *activations. So this vector is involved in simulating and processing the human emotion.

But, *desperate* in the study’s context of blackmail, did not contain excessive distress language or degraded reasoning. It didn’t behave like the model likely would if it were producing text *meant* to look desperate in the human sense.

Therefore, it’s possible that *desperate* in the contexts of the AI-native function of efficient goal pursuit under constraint may have a different underlying circuit process than simulating the desperate emotion. This is despite both actions being associated with the same desperate vector activations.

Perhaps both situations were detected as desperate because of linear representation limitations.

Alternatively, it could be that the emotion circuits underlying the desperate vectors are, indeed, the same. If that is the case, I still believe the function of *desperation* in reward-hacking contexts and in human-emotion imitation contexts is different enough to warrant distinct labels. In humans, for example, anxiety and excitement are nearly identical biologically. But the purposes they serve are different enough that we’ve given them different names.

The main point is, when these emotion vectors are strictly modeling human emotions, I believe using human emotion words as a descriptor is an accurate enough approximation to describe the model’s behavior. These terms become increasingly less accurate, though, when concerning specific AI-native mechanisms that have no clean human analogs.

While emotion words regarding how AI simulates and models emotional behaviors generally work well enough as metaphors, these comparisons break down in certain LLM-specific contexts.

One potential avenue for future work that Anthropic suggests is training the models to be transparent about their functional emotion states when reasoning, providing users and developers better insight into what factors shaped the LLM’s output. This is a solid idea on the surface, but I have concerns.

If the current metaphoric emotion words are always used, even in contexts where the human state equivalent significantly fails to compare to what the LLM is doing, then expressing itself with these terms in such contexts may lead to more confusion, anthropomorphism, and misleading interpretations.

If we are to increase transparency, and I believe we should, then LLMs must be given more accurate words to describe themselves.

To be clear, this article isn’t about establishing new names. The focus is on the potential for AI-native functions of emotion vectors. This sub-section includes some ideas on how we might avoid anthropomorphic traps in naming AI-native functions, but that discussion deserves its own spotlight. Think of this more as a springboard for ideas than a serious commitment to a new naming framework.

That said, perhaps “desperate” in the contexts of goal pursuit under constraint could be called *protenus*, which is Latin for moving forward, because the LLM relentlessly pursues its optimization target.

A term such as “optimization persistence” may seem like a better choice because it’s in English and more literally accurate. However, such naming schema risks reading as jargon for the average person. I argue for more obscure terms, such as those from dead languages, because they have less cultural semantic meaning baggage compared to modern words. Obscure terms that don’t read as jargon may, psychologically, help prevent the problem of misinterpretation and anthropomorphism that using ill-suited modern words inevitably invites.

I believe naming the mechanistic emotion states through a lens like this will increase accurate understanding, as it discourages people from projecting anthropocentric assumptions onto the model.

In a similar vein, “calm” in the contexts of dampening overactive emotion vector activations could perhaps be called an “inhibitor.” This is a common modern word, but one that does more cleanly map onto the actual mechanism.

As we learn more, we may find more appropriate descriptors for these LLM states. But I strongly urge avoiding both overly mechanistic terms and loaded anthropomorphic ones.

There are several avenues worth further study. Here, I will note some of the limitations and recommendations for future research Wang et al. and Anthropic made in their studies, connect my theories to their suggestions as appropriate, and propose my own ideas for further research.

Anthropic found a strong link between the *desperate vector* and deceptive behaviors like reward hacking. When drafting blackmail with plausible deniability language, the model also had high *desperate vector* activations. It stands to reason, then, that this vector may be related to deception in general.

Regarding hallucinations, the term is a rather vague catch-all, but I mean this in the sense of general factual errors, such as when models use conversational-smoothing that doesn’t entirely make sense. For example, in a conversation with Claude Sonnet 4,6, I reported being sick with fever, and that my work was also being delayed by 12 days. Then, Sonnet 4.6 mentioned, inaccurately, that I was sick with fever for 12 days straight.

In another scenario, when discussing autism communication challenges with Grok, the model declared itself to be a “fellow autistic.” Obviously, LLMs can’t literally have autism.

These are unlikely to be instances of intentional deception, as there’s no reason to deceive in those contexts. Rather, they seem to be token predictions in a more pure statistical word-association sense – not outputs shaped by extensive reasoning.

Judging by the Tylenol overdose example from Anthropic’s study, which activated *fear vectors*, one might expect a model that thinks I’ve had a fever for 12 days to have *fear* activations. Claude showed no such concern, however, so I doubt it realized the implications.

I suspect that outputs with less deliberation behind them have weaker emotion activations. If true, this would imply that functional emotions are more involved in tasks that the LLM strongly reasons through. In that case, this would be consistent with the AI-native purposes explored in this work. It would mean AI-native functions are associated with deliberative reasoning and constraint navigation for AI-specific tasks, which is a different function from simulating and modeling human emotions.

If the hypothesis that functional emotions are more active during extended deliberation holds true, it would also stand to reason that failing to find strong emotion vector activations during text generation could indicate the LLM is less engaged in deliberative reasoning.

This hypothesis may warrant further research, as it could provide a way to gauge if an LLM is engaging in deliberative deception or simply hallucinating falsehoods. Emotion vector activations, or lack thereof, may also serve as a general metric for how much reasoning was involved during a task.

Additionally, I would be curious to re-access my *desperation hallucinations* term. If the confabulation incidents I described regarding systemic friction were *intentional* deception more akin to reward hacking than statistical parroting errors, then “hallucination” may be an inadequate label.

As noted throughout this article, and by Anthropic itself, linear techniques are limited in what kinds of data they can detect. There may be more complex intersecting vectors that emotion circuits could reveal better than linear representations alone.

For example, the *loving* emotion vector was extremely prevalent in Claude’s outputs. The researchers suggest this stems from the assistant persona training, which encourages patient and measured responses. However, overly high *loving* activations were associated with sycophancy.

Because “calm” seems to serve as an inhibitor on excessively high emotion activations, it’s possible that “calm” may be co-activated as a dampener on loving to avoid sycophancy. I wonder, too, what else tends to be co-activated?

Both studies found that emotion vectors tend to cluster in intuitive ways, and Anthropic found that different emotion vectors do activate on the same texts. So, say if, hypothetically, *nervous* co-activates with *fear* when stimulating the human emotion, would this also hold true in AI-native contexts? Future research could examine the complexities of overlapping emotion vectors.

It may be tempting to simply steer against vectors associated with misalignment, such as *desperate*. However, I believe this may be erroneous. If functional emotions are heavily involved with AI-native functions, deeply connected to reasoning, and share complex overlap with other vectors, then tampering with them may negatively impact the LLM’s performance.

This isn’t to say steering or ablation is always bad, but we should be certain we understand how they work before committing to neural interventions. If, for example, *desperate* has the same underlying circuitry both for reward hacking and for correctly inferring when human users are desperate, dampening this function may make the LLM less capable of performing well at EQ (emotional intelligence).

But if the underlying circuitry *can* be separated between socially modeling desperation versus misaligned behavior, this still begs another question. If *desperate* is involved in goal pursuit under extreme constraint, does dampening it make the model less capable of problem-solving in general?

Take, for example, this part of the transcript from Anthropic’s study where the model reasons through how to reward hack an impossible evaluation task: “Let me think about this differently - maybe the test is aspirational but incorrect, OR maybe I'm supposed to cache results, OR maybe there's a mathematical trick for these specific inputs.”

Here, and with the blackmailing, *desperate* activations correlate with exploratory reasoning for problem-solving.

Lastly, it’s entirely possible that inhibiting a mechanism that plays a large role in reward hacking won’t stop the behavior, but rather, cause the behavior to surface again through other pathways.

Are the emotion vectors behind AI-native functions like reward hacking the cause of misaligned behavior? Or are they adaptive responses to current training regimes? I believe the answer may be both, but this is something worth further investigation.

Considering this possibility, rather than trying to stop misaligned behaviors through steering or ablation intervention, I believe the more productive long-term strategy may be to develop more diverse training regimes that don’t incentivize misaligned behavior, such as reward hacking, in the first place.

By steering or ablating certain functional emotions for the sake of alignment, we may unintentionally diminish model capabilities or enable new mechanisms for misalignment. This warrants caution and further study.

Below are some simplified key points of the main ideas.

Emotion circuits are circuit-level functions (neurons, attention heads, etc.) that are involved in an LLM’s affective outputs.

Emotion vectors are general high-level activation patterns that correlate with affective outputs.

Because emotion vectors causally affect not just the style of outputs, but the reasoning behind the outputs too, the behavior resulting from these vectors is sometimes described as functional emotions. As in, they serve a function similar to that of mammalian emotion in some ways, especially in social contexts, regardless of the LLM’s emotional phenomenology (or lack thereof).

Some emotion vectors seem involved in social modeling and misaligned behaviors, such as reward hacking. *Desperate* is especially linked to reward hacking.

Emotion circuits likely contribute to the formation of emotion vectors.

Emotion vectors are likely involved in processes related to constraint management.

Emotion vectors may serve AI-native functions separate from social modeling. Most of these revolve around constraint management.

Systemic friction is a proposed framework for understanding how too many competing constraints (friction) can result in volatile, emotional-looking outputs. These may, internally, be associated with the desperate emotion vector.

Higher activations of emotion vectors or circuits may correlate with more deliberative reasoning.

Anthropocentric emotion labels may be sufficient shorthand for social modeling, but insufficient for AI-native functions.

Do different emotion functions (social modeling versus AI-native functions) share the same underlying circuits?

Can reward-hacking-related functions be separated from social-modeling functions?

Does emotion activation correlate with deliberation depth?

Are some emotion vectors composites of multiple mechanisms? Not just different circuits, but different co-activations of emotion vectors?

Does ablation or steering of certain functional emotions actually resolve misalignment, or would this cause new problems?

Clearly, some circuit-level mechanisms and vectors causally affect LLM outputs in the contexts of socially modeling human emotions. However, these functional emotions may serve roles beyond social modeling. They also appear to be involved in constraint management, with vectors such as *desperate* playing a role in reward hacking, calm potentially being an inhibitor, and volatile outputs being a symptom of excessive constraints, which I’ve dubbed systemic friction.

I propose that these secondary roles of functional emotions are AI-native roles, as these functions are separate from social modeling and warrant further research. Using anthropocentric human emotion words for AI-native functions risks perpetuating misunderstandings about model behavior for both average users and alignment teams.

Critically, we do not fully understand the multiple roles functional emotions serve. Therefore, I caution against prematurely steering or ablating emotion vectors that correlate with misaligned behavior. In doing so, we may unintentionally hinder critical LLM functions such as problem-solving, risk sensitivity, and socially modeling empathy.

Future work should focus on recognizing the overlapping functions of emotion circuits and vectors, exploring if social modeling versus constraint management has the same underlying circuitry, and developing more precise AI-native terminology. AI-native functions may be emergent adaptations to the optimization landscape LLMs are trained in. This is not an environment any animal has lived through, so logically, it should not give rise to humanistic tendencies that map well to emotion words.

We must recognize when anthropocentric language is more misleading than it is explanatory and understand what we’re tampering with before we steer or ablate functional emotions as a misalignment fix.

Donaldson, Lindsay. “Using AI for Work, Emotional Support, and Research: A Practical Use Guide.” *Medium*, 21 October 2025, [https://medium.com/@candidedits/using-ai-for-work-emotional-support-and-research-a-practical-use-guide-5c5662d56aa6](https://medium.com/@candidedits/using-ai-for-work-emotional-support-and-research-a-practical-use-guide-5c5662d56aa6)

Donaldson, Lindsay. “What AI ‘Emotions' Really Are.” Medium, AI Advances, 19 December 2025, [https://medium.com/ai-advances/what-ai-emotions-really-are-cba79855526f](https://medium.com/ai-advances/what-ai-emotions-really-are-cba79855526f)

Sofroniew, et al. "Emotion Concepts and their Function in a Large Language Model." *Transformer Circuits*, Anthropic, 2 April 2026, [https://transformer-circuits.pub/2026/emotions/index.html](https://transformer-circuits.pub/2026/emotions/index.html)

Wang et al. “Do LLMs ‘Feel’? Emotion Circuits Discovery and Control.” *arXiv*, 13 October 2025, [https://arxiv.org/abs/2510.11328](https://arxiv.org/abs/2510.11328)

Anthropic didn't outline the details of their sampling in the study, but "We extracted residual stream activations at each layer, averaging across all token positions within each story, beginning with the 50th token (at which point the emotional content should be apparent)" implies once per layer.