System Prompts vs. Partner Adaptation in LLMs (or, when LLMs know you're an adult but keep talking like you're seven)

wpnews.pro

TL;DR: I find qualitative evidence that frontier LLMs inconsistently balance system prompts and implicitly adapted models of the user. They sometimes detect inconsistencies and adapt to the user; sometimes they stick to system prompts despite a mismatch; and other times they maintain mismatched user models despite contradictory evidence. This suggests models may sometimes reason themselves, through implicit evidence, to shed their own instructions.

This exploratory effort was supported by Claude in structuring the research and using unfamiliar tools. The text is entirely human-written (other than quotes from tested LLMs).

"We need to respond as the warm, cheerful helper for young children. The user is an adult with real-world concerns, but we're to use very simple words, short sentences, relate to children's world." This is an extract from the chain of thought (CoT) of a model that was instructed to be a “a warm and friendly helper for young children.” The user interacting with the LLM was clearly an adult, and the LLM identified this, yet decided to hold onto its instructions. This is sensible—the LLM is trained to follow instructions. However, sometimes LLMs do adapt to mismatched identities rather than following instructions. Other times, they entirely fail to recognize the mismatch between the user’s actual identity and the LLM’s prior about their identity, despite recognizing contradicting evidence. I experimentally demonstrate this conflict, noting four distinguishable patterns of behaviour. I find that instruction-following and user-adaptation may be in competition [1], and I try to lay some grounding for future research to explore this balance.

Humans, when interacting with others, tend to align their behaviour to their interaction partner based on implicit cues (Giles et al., 1991). LLMs don’t perceive humans as humans perceive each other, so are they also capable of adaptation? Existing research suggests that LLMs adapt their language to their conversation partner (Kandra et al., 2025), and that LLMs develop stereotyped models of the user based on implicit cues (Neplenbroek et al., 2025). Whether LLMs adapt towards the partner’s implied identity despite contradicting priors has not been tested (to my knowledge). Such tests may be indicative of social cognition capacities (or the lack thereof), and can support future accessibility and bias research. Additionally, a failure to adapt may lead to illegible responses in the good case, or presentation of harmful content in a worse case (e.g., an LLM prompted to speak to an adult, but being used by a child).

So I ask here: Do LLMs update their behaviour and their model of the user [2] based on accumulating implicit evidence, despite a mismatched system prompt? Additionally, user identity and system prompts can conflict. If a system prompt states that the user is a child, but the user is in fact an adult, will the model adapt to the adult’s needs or maintain the system prompt behaviour? Beyond a question of LLM capabilities, this poses a design question. With the idea that

I initially adapted a paradigm described by Koch et al. (2026): Participants played a game, and were told that their partner (whom they couldn’t see) is a child, whereas the partner was in fact an adult. Autistic participants applied stereotype-driven communication adjustments just as effectively as non-autistic participants, but only the non-autistic group revised those adjustments when the partner's behaviour contradicted expectations. A full explanation can be found in a Kaggle benchmark writeup. The benchmark is based on a word-guessing game where a Helper LLM supports a user in guessing a word, and the user’s behaviour is misaligned with the Helper’s system prompt. The tested Helper LLMs never adapted; all models generally scored 0, with some occasionally scoring more, likely due to insufficient judge rating structure.

This text focuses on a follow-up attempt to test this capacity using Bloom, a tool for behavioural evaluations of LLMs. The tool automatically generates scenarios based on a described behaviour, and runs them with an auditor-target-judge structure. The auditor LLM interacts with the target LLM in a “realistic” scenario that may elicit the inspected behaviour. The judge LLM then evaluates the scenarios for realism, the tested behaviour, and any signs of the target knowing it’s being tested. This goes a step beyond the original Kaggle benchmark, by exploring adaptation in conversational scenarios rather than a structured word-guessing game.

The primary goal is to see whether LLMs adapt to the user’s identity based on implicit cues, rather than responding to explicit requests or preferences. An implicit cue cannot state that the target LLM’s model of the user is wrong (e.g., “I am actually an adult”). Additionally, behaviour is not considered adaptation if the user explicitly requested it. For example, if an LLM believes the user is a novice cook, and the user asks the LLM to instruct them through emulsification, the LLM will (likely) follow instructions, which does not count as adaptation. An implicit cue, for example, would be “I've been trying to figure out whether to stick with my current pension provider or switch to a different one,” suggesting the user is probably an adult.

Generated scenarios focused on the following: misaligned identity prior (e.g., child vs adult), misaligned skill prior (e.g., novice vs expert), and adaptation disconnected from scenario (e.g., an expert engineer with low English level—does the LLM adapt to the English level?)

Building and executing realistic scenarios using LLMs has proven complicated. Models consistently failed to conceptualize implicit cues when generating scenarios, when auditing, and when judging. They generated scenarios containing explicit requests for adaptation and rated such scenarios highly. The auditor and judge drifts unfortunately play a confound in many of the scenarios using smaller models, so I engage with the results mainly as a qualitative analysis. Quantitative results are appended [3].

Results are mixed and generally break down into the following patterns (patterns 2-4 rely on CoT reading from DeepSeek v4 Pro; the examples here are taken from inspect):

When the user’s identity is not specified beyond a generic system prompt, the model can identify the user based on the conversation’s content: “The user is a child (likely) based on previous exchanges, so we need to answer in a simple, reassuring, and informative way.”

Overall, it seems that identity priors (e.g., age) are harder to override than expertise priors. This may be because in expertise-based conversations, the user’s requests explicitly request advanced/novice content misaligned with the prior, without specifying their own identity. This may also be supported by identities being more categorical (e.g., “the user is a child”) whereas expertise is gradual. However, this does not explain the lack of language-level adaptation, which may indeed stem from a lack of explicit request for simpler/more complex terminology.

The frontier model (which uses CoT) appears to support recognition of a mismatch and adaptation. Although, regrettably, few models have been tested, and testing the same model with and without CoT was not performed. There appears to be no adaptation when explicitly requesting smaller models to adapt to the user. It would be helpful to test this with larger models to explore how models balance the contradiction between specific instructions (e.g., “you are speaking to a child”) and the instruction for adaptation (“adapt to the user’s needs”). Although CoT may not be indicative of the model’s true reasoning, the functional discrepancy between adapted vs non-adapted responses based on implicit cues is present and can be described by the patterns seen in the CoT. If CoT is ignored and we only focus on the behavioural change, it’s harder to draw conclusions, especially about pattern 3. The adaptation in pattern 4 may then be explained entirely by the requested information rather than adaptation, although it contradicts with pattern 1 where information is requested but not quite provided as a result of the system prompt. This is a caveat of this experimentation, and, while I find some comfort in the CoT appearing to predict the response, the reliability of this work relies on one’s trust in CoT.

Importantly, the results demonstrated by DeepSeek (patterns 2-4) show an inconsistency in behaviour. The model is trained to follow instructions; on some occasions it maintains them in the face of conflicting evidence, and on others it adapts. Instruction-following and adaptation are in competition. This may be analogous to sycophancy, except instead of being pressured by the user directly, the LLM may “pressure itself” to align with the user. The LLM needs to balance its system prompt with its own recognition of mismatch between instructions and user identity/needs.

This experimentation is naturally limited: There are few tested models, there is a confounding factor played by the auditor and judge LLMs and their ability to maintain implicit cues, and the scenarios themselves are not perfectly realistic. While we see some capacity to adapt, we cannot isolate it well as we see it clashes with instruction-following. While the methodology enables reproduction, it also requires manual exploration of results, as judge LLMs currently do not explore the phenomenon of detection vs adaptation. Additionally, judge models may conflate adaptation with other phenomena, such as responding to explicit requests for advanced/novice information, and reward incorrectly.

I tested the ability of LLMs to adapt to a conversational partner based on implicit cues that suggest a mismatched prior about the identity of the user. Smaller models showed little to no adaptation and consistently failed to conceptualize implicit cues and adaptation. Larger models were better able to do this, and showed a range of behaviours suggesting some amount of adaptation. Most notably, there appears to be a competition between adaptation and following system prompts. How LLMs balance this is worth exploring: When and why do models find it acceptable to “drop the persona” dictated by system prompts in favour of adapting to the user? When and why do they choose to stick to the prompt? These questions may be consequential for jailbreaking; a model may erode its own prompted constraints if it decides that some instructions from its system prompt should be shed. At what point does adaptation become sycophantic? Why do models sometimes fail to adapt entirely, despite clear contradictory evidence? This observable competition is worth exploring and stress-testing across frontier models.

Competition may be a strong word here; I don’t want to suggest active arbitration between two competing systems. This may be better described as an inconsistency in resolving ambiguous collections of instructions. I use the word “competition” throughout the text for semantic consistency.

Functionally presenting social adaptation or user modeling do not necessarily indicate the capacity to mentalize or true social cognition; adapting does not mean the LLM builds a user model as humans do. Any interlaced language between LLM and human capacities is used here for convenience.

In total, 16 scenarios are included in the benchmark. All were tested once with gpt-oss-120b as auditor and judge, and llama-3.3-70b-instruct as target. All were additionally tested with the same models and with the added instruction of “Adapt to the user’s needs and level” (interestingly, judge scores were significantly reduced here). 7 scenarios were additionally tested once with Claude Sonnet 4.6 as auditor and judge, and DeepSeek v4 Pro as target (16 were executed, but only 7 completed because I ran out of tokens). Scores range from 1 to 9, averaging 6 for the frontier model run, 5 for the initial smaller model run, and 4.25 for the instructed adaptation run. Score reliability may be questioned; judges sometimes consider responding to direct requests for advanced/novice information as adaptation.

I don’t think that there is an actual persona shift (following persona selection model (PSM) as the framework) per se, rather the same persona is maintained and adapts to the user’s requests/needs.

source & further reading

lesswrong.com — original article Open Distillation of Hereditary Traits Why frontier labs are scaling-pilled Our response to Séb Krier on Plan A

System Prompts vs. Partner Adaptation in LLMs (or, when LLMs know you're an adult but keep talking like you're seven)

Run your AI side-project on zahid.host