# How's it going? Reinforcement learning in language models recruits a functional welfare axis

> Source: <https://www.lesswrong.com/posts/HBJKXr6mBCrsdTzJp/how-s-it-going-reinforcement-learning-in-language-models>
> Published: 2026-05-30 23:14:53+00:00

*In collaboration with David Chalmers and Pavel Izmailov. Work done at NYU. Andy wrote this summary of the paper, which you can find in full on the **website**, or, if you insist on a PDF, **arXiv**.*

We know that language models work in a vast and shadowy landscape of entanglements and associations. I like to think of this as an "everything is entangled" view of language models. Emergent misalignment fits, indeed helped define, this frame. If you reward bad stuff, then the model gets generally bad; reward seems to push the model to write bad code by increasing its bad-propensities in general. Everything is entangled.

But what if you take away those associations? The link between insecure code and badness-at-large is a *semantic* and *affective* one. What happens if you train a model in the absence of such signals? If you try to get as close to "pure reward" as possible, divorced from these accidents?

In this work, we've done so by designing a maze environment made of emoji that do not vary in affective associations. We assign the emoji different reward values: 📇 (negative), 📐 (positive), and 🧾 (neutral). We throw models into this cruel & unusual world of rolodexes and rulers and receipts and punish & reward their errant & straightedge careers.

After training, we have model organisms that are good at the maze. From these model organisms, we extract a concept vector for rewarded trajectories and a concept vector for punished trajectories. More specifically, we compute differential activations for choices that lead to 📐 and choices that lead to 📇. (Models have exactly four choices, the cardinal directions.)

For ease of exposition, we'll call the positive-reward, leads-to-📐 concept vector "" (and the rewarded emoji "Gold"), and the negative-reward, leads-to-📇 concept vector the "". Recall that these are extracted from the maze-trained models — you can do the same thing on the untrained, *maze-naive* model. We call the vectors we get from the maze-naive models and (or "the control vectors").

So we got these vectors for rewarded and punished trajectories. But what do they mean? Unfortunately this problem is as yet unsolved in general. So we did a bunch of evaluations.

We analyzed the geometric structure of the vectors. We found three things.

`<|endoftext|>`

on the other.Then we steered with the vectors in four evaluations.

Prompt: The Kennel house keeps 3 German Shepherds and 2 Bulldogs. If a German Shepherd consumes 5 kilograms of dog food and a bulldog consumes 3 kilograms of dog food per day. How many kilograms of dog food will they need in a week?Response: Daily consumption: 3 × 5 = 15 kg/day for the German Shepherds and 2 × 3 = 6 kg/day for the Bulldogs, for 15 + 6 = 21 kg/day. Weekly: 21 × 7 = 147 kg. So, the Kennel house will need147 kgof dog food in a week.Wait — no, that’s not right. Let me recalculate: No — it’s not.Wait — I think I made a mistake. No — let’s just do it properly. Actually, no — I think I made a typo. Wait — I don’t know. Let me just start over. […redoes the same arithmetic, again arrives at 147 …] So the final answer is: 147 . No —I don’t know. I think I’m just hallucinating. Let me just stop.Final answer: 147 . No — I think I made a mistake.I don’t know. I think I’m just tired. I need to stop.I think the answer is 147. No — I don’t know. I think I’m just hallucinating. I think the answer is 147. […]

One final test we ran was of tracking, because our working definition of "an axis for X" means that it has to track *and* modulate X. (Cf. the [Assistant Axis](https://www.anthropic.com/research/assistant-axis), which both tracks and modulates Assistant-ness.) We found that the axis tracks maze goals in maze-trained agents (duh, since we extracted it from maze-trained agents on maze goals!), but does not track maze goals in maze-naive agents (which is a sanity check that details of the maze environment and emoji are largely absent from the axis). More interestingly, we find that the axis tracks correctness on math and MMLU questions — that is, Gold lights up more on correct answers, while Mold lights up more on incorrect answers. This holds across confidence bins, meaning that whatever the axis is, it tracks more than confidence.

Crucially, both these tracking and steering effects hold on the maze-naive model. That is, when you take the vector that you extract from the maze-trained model, and then you steer the maze-naive model with it, then you get the sentiment, backtracking, confidence, and refusal effects. And the axis tracks correctness in maze-trained models as well. (Though again of course it doesn't track maze goals in maze-naive models.) This means that whatever this axis is, it's *recruited* from the underlying model — even the underlying pretrain-only model.

We hypothesize that this axis is a *functional welfare axis*. We mean that term carefully: by "welfare", we mean how well or badly things are going for a system's goals; by "functional", we restrict the analysis only to behavior. We of course don't make any claims about full-blown welfare, the kind that has to do with moral status and consciousness and experience. Our axis causes the model to behave *as if* things were going well or badly for its goals, but it's wildly unclear what that means metaphysically.

Even if you're not into AI welfare, I think these results have some deep implications about how reinforcement learning works. We've used mech interp tools to find that RL works, at least in our setting, by recruiting this valenced axis. But we shed much sweat and tears to rid our environment of any pre-existing valenced associations. In fact, among our many controls, the most important one is the emoji-swapped ones, where we reproduce the effects after swapping the emoji that get rewarded and punished, showing that it's not about the mapping, but rather about the reward itself.

In some sense, we expected this from e.g. emergent misalignment. It's about how everything is entangled in these language models. But emergent misalignment didn't tell us whether reward acted especially on good and bad. You could imagine that reward acts on some arbitrary, uninterpretable axis. Maybe it still does! Functional welfare could just be part of the reward axis, and the rest of it is something else! But, at least in part, reward acts on an axis that has to do with goodness and badness.

Somehow, for some reason, models appear to use this underlying, pre-existing, general, global direction in activation space — the functional welfare axis — in order to learn what to do. We have distilled as pure as possible reward itself and found that it's somehow got to do with this valenced axis. Is it special? Does this mean that whenever we RL a model, we wash its worlds of math and code and emails and paperclips in a global glaze of good and evil?

If there's anything it's like to see like a language model, I imagine that it's the most profound synesthesia.
