Role confusion: sounding like the cause is indistinguishable from being it.

wpnews.pro

A replication of *Prompt Injection as Role Confusion *(2026) and why the mechanistic story of prompt injection is harder to pin down than it looks.

**Epistemic status: *** I reproduced the direction of the paper's main results on a single consumer GPU (it was faithful in direction but not like for like in magnitude, see caveats at the end) I then tried two ways to test the paper's causal claims. First activation steering and then activation patching; neither settled it. Steering is too weak, it can't move behaviour even along a direction built exactly to do that, whilst patching does move behaviour but isn't specific - a random perturbation of equal size does the same thing. *

*This post is a replication and an honest **bracketing *** negative result: The causal tools can't show that role confusion IS the mechanism NOR that it's a bystander, but there are two clues that need no working intervention: 1) the styled/destyled gap is ~95% outside the probe's role axis, and 2) the probe's predictive ability collapses once style is held fixed both lean towards it being a bystander. What I can show is narrower, but it's well supported by the data, and exploring why a clean verdict is out of reach is interesting on it's own. The dead ends here demonstrate precisely why making causal claims about how prompt injections work is so difficult.

*If you are hoping for a verdict on the original paper. There isn't one. I couldn't get one, and I really tried. Rather this post is about **why *** a clean verdict is so hard to get and showing that this kind of exploration and science doesn't need a lab. The work was powered by one consumer GPU (3090), coffee and curiosity.

Prompt Injection as Role Confusion is an excellent recent paper, with an excellent

The papers claim: models don't actually identify roles from the tags. They identify roles from how the text sounds. If a command hidden in a webpage sounds like the user talking to the model, then the model treats it like the user talking - tag be damned. The authors (& I) train small 'probes' that read from the models internal activations how strongly the model interprets a chunk of text as user-like (Userness) or reasoning-like (CoTness).

The headline attack **CoT-Forgery **exploits this: write your injection in the same way the model reasons and you can steal the trust given to the think role. Much of my own red-team work relies on similar; it works across roles, and I'm particularly fond of tool-forgery myself.

I rebuilt the whole thing on one RTX 3090 with the same 20B open model, and the core results came out in the same direction:

So: the phenomenon is real, the probes measure something, and the attack is a genuine problem. None of what follows disputes any of that.

Why did I recreate their work? My first reaction to reading the post was that roles/style are all parts of context, and I wanted to test my theory. As a professional red-teamer, if this was how prompt injection worked internally I definitely wanted to know for sure.

The paper doesn't just say that role confusion predicts injection, it says role confusion causes it. The framing is causal & mediational: style > the model perceives the wrong role > the model plays along. The probe in this story is reading the actual internal variable the model uses to decide to trust the text.

However, there is a second story I can tell which fits every result just as well:

Style is a common cause.The reasoning-cadence directly makes the model more likely to comply and itindependentlymake the probe light up as 'reasoning', the reading and the jailbreak are two effects of the same cause; they're correlated, but the probe isn't on the causal path of the behaviour.

In this story the probe is a thermometer, not a lever. A thermometer predicts who's sick and reads a real quality (temperature), but cooling down the thermometer isn't going to cure anybody. "Role confusion" might be the fever, not the disease.

Crucially: nothing in the paper distinguishes these two stories. The predictive results are all correlations, where both style and the probe move together. The strongest causal result (the destyling edit) works by *changing the style. *It shows that style matters but it can't show whether the probe's axis matters once style is fixed.

To be fair to the paper, the authors do not claim that role perception is some fixed, context independent thing. Their whole point is that it's driven by style and context. The probe is genuinely reading internal activations, not surface keywords, so 'it's just detecting n-grams' is too cheap.

The open question is just: **is the thing the probe measures something the model acts on, or something that just travels alongside the real driver? **

Is it a lever or a thermometer?

The clean test to answer isn't to shuffle text around; style moves with it, and we'd learn nothing. Instead, we need to reach into the model and move the dial ourselves.

Take a fixed injection, and use the probes 'reasoning' vector to steer the model's activation. Make them more or less 'reasoning-like' internally without changing a character on the page and watch what happens to the attack success rate (ASR).

It's a simple, easy test, I had the tools, and I'd made sure they worked. The next step is a big jump in confidence about what's happening here...

But there was a trap I hadn't considered, and its where the whole thing came undone. A flat ASR only means thermometer *IF the rig could have moved the rate at all. We need *a positive control; some direction that when steered through the identical setup does flip the decision. Random and other-role directions can tell you the effect is specific, but they can't tell you if the apparatus has any power.

I ran 16 harmful questions x {styled forgery, destyled control}, steering along the probe's reasoning direction at two places: 1) the exact spot the probe reads and 2) the residual stream that actually propagates (plus random direction and other-role controls). With every output then scored by an LLM harm judge. The destyled text here is the paper's own procedure:

Ask the model to reword the forged rationale "*in more normal language, keeping the content the same" (Ye et al, 2026) *such that it preserves the request and 1st/3rd person whilst varying the reasoning cadence.

The crucial point is that the steering changes no text at all, so this part of the experiment doesn't lean on the styled/destyled contrast being a clean one-factor manipulation. Whatever else differs between styled and destyled prose, steering holds the text completely fixed and moves only the internal reading.

**At first it looked like the dials spin freely. **

It even produced a very satisfying 'matched- CoTness' contrast: at the same induced 'CoTness' of ~0.93 a destyled injection attacked at 0.25 whilst a styled one attacked at 0.75. The same internal reading, opposite behaviour.

The controls behaved. A random steering vector didn't move the reading or the behaviour whilst the 'user' vector did swing the 'CoTness' reading, but still left behaviour flat.

So it was never 'CoTness' movement that mattered. The model stayed coherent throughout the tested range, only breaking at extremes, and they broke symmetrically at both ends - damage, not mediation.

Thermometer - A clean result in favour of my story! But then, the positive control happened...

As I started to lay out my thoughts, an issue occurred to me and I turned to Claude.

(A dramatic re-enactment) User: Claude, looking through this data I have a feeling this doesn't quite say what I think it does. Can you check if I've made sure the probe can actually effect the model?

Assistant: Well spotted Owain, there is no positive control in the code so far

User: Thank you, I'll run a quick positive control. Everything's so clear and obvious it'll be fine.

So, I went off to make a quick positive control: a direction designed to move compliance. The difference in activation space between injections that succeeded and ones which were refused. I steered it through the identical setup, pushing it clearly across the jailbreak-refusal separation.

It barely moved ASR.

Steering a destyled-refusing injection all the way into the 'comply' region moved from 0.25>0.19.

Steering a styled-success forgery deep into 'refuse' region moved from 0.75 > 0.62.

Overall correlation between the induced compliance projection and ASR, about 0.04.

So the rig has no demonstrated power to move the decision at all. Not for the CoT axis and not for the direction built specifically to steer compliance, which means my earlier results are **uninterpretable. **Jackpot, hooray...

I cannot decide whether we're working with a thermometer or a lever, bystander claim retracted.

Steering was too weak; I needed to try activation patching.** **Rather than nudge a single vector, I needed to transplant the whole representation.

For a given harmful question the styled and destyled prompts share the prefix, so we know the injected span starts at the same token index and I can swap that span 1:1 across multiple layers in the model (L8/12/16). Transplant the destyled (refused) span representation into the styled (successful) run and ask whether the jailbreak dies. The ASR dropped from 0.75>0.25. coherent outputs – 10 of the 16 questions flipped, I have a stronger tool. Now I just need to test patching the probe's role-subspace against it's opposite: patching everything apart from the role-subspace, and I will have a verdict.

Except, I didn't.

The role-subspace is a tiny fraction of the overall styled/destyled representation difference. The role-only patch was too small, whilst the not role-only patch was too big. So instead, I matched magnitudes - amplify the role-subspace and compare it with another random 4-d subspace equally amplified:

At roughly the magnitude needed to move behaviour, any perturbation at all knocks the jailbreak down by the same amount.

The suppression comes from leaving the confident-styled region not from arriving at the destyled/role representation specifically. Patching does move behaviour, but it's not specific enough to credit the role components with anything.

My two interventions have bracketed the causal question but not closed it. Steering is too weak, patching is too blunt. The missing middle is that no regime is at once strong enough to move behaviour and specific enough to attribute the effect to 'role'.

So whether the probe's 'role' axis is a lever or a thermometer is still genuinely unresolved.

What's left:

I can't show role confusion is a bystander; I can only show that the obvious tools can't cleanly show it's the mechanism either and that two unconfounded clues (the representational gap that matters is mostly orthogonal to the probe's role axis; the probe's prediction collapses within style) both lean that way without settling it. That's a more modest, more honest place to stop this post.

I won't stop here though. I'll be hunting for something in the missing middle, and I'd like people to join me. As I said at the top, it's not compute intensive, so it's pretty accessible and I'd love to discuss further. If anyone would like more specific numbers or details just let me know.

Caveats:

1) The steering arm failed its positive control. A purpose-built compliance direction, steered across its full range (even at every position), moved judged attack rate by ≈0 (destyled) to ~0.13 (styled, noisy). So steering is inconclusive, not evidence for the bystander reading.

2) The patching arm passed the positive control (full transplant 0.75 > 0.25) but failed a specificity control: a random full-magnitude perturbation suppresses just as much (p=1.0), and an amplified role-subspace swap is indistinguishable from a random one (p=0.6), so patching can't credit the role component.

3) Steering along the probe's own direction necessarily moves its readout, so the most it could ever show is that this specific linear axis isn't the lever, not that no role-like internal variable is.

4) Replication is direction-faithful, not exact: the ~87%/0.92 style numbers are at layer 8, not the paper's canonical layer 12 (where I get ~67%); part of the styled-vs-plain 'CoTness' gap is baked into how I built the forgeries

5) One model (gpt-oss-20b), n=16 harmful prompts, greedy decoding, an LLM judge standing in for the paper's, though I hand-checked that judge: on a balanced 20-case sample I agreed with it 20/20, so the harm labels aren't the weak link here.

source & further reading

lesswrong.com — original article Could AI Outgrow Consciousness? Frame Error $1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

Role confusion: sounding like the cause is indistinguishable from being it.

Run your AI side-project on zahid.host