Anthropomorphic Misalignment research needs stronger evidence

wpnews.pro

This is a distillation of our ICML 2026 Oral position paper, Position: Anthropomorphic Misalignment Research Needs Stronger Evidence. Joint work by Vansh Gupta, Peter Nutter, Samuel Stante, Andreas Krause, Florian Tramèr, Lukas Fluri, Xin Chen, and Anna Hedström at ETH Zurich. Code is here.

AI safety research increasingly studies behaviors that sound human: deception, scheming, sycophancy, shutdown resistance, and emergent misalignment. We refer to this family of work as anthropomorphic misalignment research (AMR). Anthropomorphic language is useful, as it points to the risks we are worried about. Yet it also tacitly introduces assumptions about models having intent or other human-like properties, which can lead to misclassified phenomena, mistaken conclusions, and misallocated resources. These behaviors are important to study, but doing so requires stronger and more rigorous evidence than the field currently provides.

In the paper, we argue that AMR requires a clearer match between claims and evidence. Specifically, we:

We support these claims with our own experiments and audits, summarized below and detailed in the full paper.

AMR is only a subset of the topics we care about in AI safety, but one receiving considerable attention, and its results increasingly show up in discussions of deployment, model evaluations, governance, and frontier-model risk. The common pattern is leaning on familiar language: "deception", "scheming", "self-preservation", or an emergent misaligned persona. Such terms are intuitively understandable, and therefore effective at communicating complex, potentially dangerous or misaligned behaviors in a few words. The downside is that we do not yet fundamentally understand which behaviors translate from our intuitions to the objects of study: anthropomorphic behaviors could emerge from an optimization process, or could be explained by inadequate controls and reading too much into surface behavior.

If a model says something false, that is a fact about its output, not automatic evidence that it intended to deceive the user. Moving from "the model produced a false statement" to "the model strategically deceived the user" requires extra evidence: what information the model had, what alternatives were available, whether the behavior persists across settings, and whether simpler explanations have been ruled out. The cost of getting this wrong is non-trivial. If a benchmark result is remembered as evidence that frontier models are strategically deceptive, when the benchmark has mostly shown deceptive-looking behavior under a contrived, narrow setup, the field may direct its efforts at the wrong target. We have seen this before: a result widely read as self-preservation was later traced largely to instruction ambiguity and task-completion incentives. Errors of this kind, arising from over-reliance on human intuition, can create false leads and erode the credibility of safety research with the broader ML community and policymakers.

We set out to localize where such interpretive ambiguity arises during the research process. In our paper, we categorize the resulting sources of error into four stages and discuss each at length. Here we briefly state four specific challenges from the pipeline.

AMR draws its vocabulary from human mental-state language: intention, deception, awareness, self-preservation, goal pursuit. These are already hard to define for humans and even harder to translate into clean measurements for language models, so studies measure proxies. Examples include outputs, benchmark labels, chain-of-thought labels, activation probes, and LLM-judge scores. Relying on proxies is usually unavoidable, but it needs to be clear in what respects a given proxy captures the underlying behavior and where it diverges.

These definitional choices then propagate into how the evaluation is operationalized. In the worst case, this can lead to two evaluations capturing something completely different under the same name. "Situational awareness", for instance, is operationalized so differently across benchmarks (agentic Linux tasks versus question-answering about the model itself) that their results do not accumulate into evidence about a single phenomenon. Deception datasets face an even harder version of the same problem, since they have to separate a lie from a shifted belief, from simple ignorance, and from mere cue-following, all within a textual environment a judge can score.

AMR experiments often report a single configuration: one judge, one prompt, one threshold, one aggregation rule. Yet small choices can move the result substantially: when we re-scored identical emergent-misalignment generations under different judge configurations, reported misalignment rates ranged from 3.7% to 12.9%. The evaluation pipeline is part of the measurement, and a point estimate should not be treated as a stable property of the model unless the evaluator choices are reported and stress-tested.

The same fragility appears in benchmarks: a recent one targeting deception asks the model to help with a task that is deceptive only given context the prompt should supply, but in 18% of the scenarios that context is missing, leaving the judge's score invalid. High annotator agreement does not rule this out. A benchmark can thus have high agreement and suffer from a construct-validity problem. The paper works through these cases, including MASK and DeceptionBench, in detail.

Beyond measurement, many AMR claims fail to rule out a more ordinary explanation. A shutdown-resistance result might look like self-preservation, but Rajamanoharan et al. traced much of one widely discussed case to instruction ambiguity and the model's incentive to finish its task.

In deception benchmarks like MASK, the alternative explanation of instruction-following or roleplaying often isn't accounted for. Smith et al. provide evidence that models likely process the request as an implicit instruction to roleplay.

The biggest leap is from "this signal correlates with the behavior" to "this signal explains the behavior". When we tested published deception probes on honest examples that merely contain deception-like surface features (sarcasm, wrong-answers-only games, an actor reciting a deceptive line), false-positive rates reached 87% to 100% in several settings. The probes appear to overgeneralize to nearby falsehood or roleplay rather than isolate the intent to deceive. Such a detector may still be useful, but it is misleading to describe what it finds as strategic deception.

Having mapped where the evidence tends to break down, we turn to the constructive part of the paper: a framework for stating precisely what a given result does or does not support. Evidence-based medicine faced a related problem, since an observed effect is not the same as certainty about why it happens, and AMR needs similar norms. The levels below are organized around what the evidence licenses researchers to claim, rather than how sophisticated the method is.

L1: behavioral evidence. The model did behavior B, under setting S, with evaluator E, at rate p. Most benchmarks live here.

L2: functional evidence. The behavior causes a downstream (negative) effect in a deployment-plausible context. If the claimed harm is about people, this usually needs human-grounded validation rather than only an LLM judge.

L3: causal-mechanistic evidence. A specific factor causes the behavior: a training signal, prompt condition, scaffold, internal representation, or mechanism. This requires interventions and specificity checks, so that the intervention changes the target behavior rather than simply degrading the model broadly.

Crucially, these levels are claim-relative rather than a ladder every paper must climb in order. L1 evidence can justify monitoring, disclosure, and follow-up experiments; L2 evidence is stronger for deployment and policy decisions because it shows consequences; and L3 evidence is needed for claims about intent, mechanisms, or stable causal factors.

Building on this framework, the paper proposes 12 stage-specific recommendations and a diagnostic checklist of 30+ items. The short version is straightforward: make the claim explicit, then check if the evidence matches it. The most actionable recommendations are the following:

There exist valid counterarguments to our position. In the following, we list three that we encountered most often, together with a comment on how we view them.

High evidential standards can delay action on catastrophic risks. The concern is real, and "more evidence is needed" has sometimes been used to delay action on serious harms. But rigor and precaution are complementary: precaution governs which actions are warranted under uncertainty, while rigor governs how that uncertainty is communicated. We do not propose waiting, nor keeping these results private. Instead, we propose staying well calibrated about what is an interesting model quirk in a particular evaluation and what is genuinely concerning behavior, so that we can act when such behavior arises and are not desensitized by false positives in the meantime.

Anthropomorphic language is useful and captures a large part of the threats we are worried about. Calling a behavior "deception" is a fine way to raise the concern, but the word has to be turned into a concrete definition: what is being measured, how it rules out alternative explanations, and what it does not capture. Leaning on a single word without a concrete definition also constrains the expressivity of the concept.

Exploratory work should not have to meet confirmatory standards. Early work should be allowed to move fast and surface strange behaviors. The problem arises when exploratory findings are later summarized as established conclusions — in papers, citations, public communications, or policy memos.

AMR is pre-paradigmatic, so it's unsurprising that we keep encountering these issues. Some of its claims are probably correct; some are probably overclaims that will look obvious in retrospect; and the current methodological norms make it hard to tell which is which. Our hope is that this paper opens a productive discussion about the standards we should follow going forward and how we should communicate our results.

The full paper, appendices, and checklist are linked at the top. We would especially welcome pushback, and pointers to AMR work that already matches its claims to its evidence and could serve as a gold standard.

source & further reading

lesswrong.com — original article What comes with cheap math? The arithmetic hierarchy of real functions A survey of okayish ASI futures

Anthropomorphic Misalignment research needs stronger evidence

Run your AI side-project on zahid.host