{"slug": "research-agenda-interpretive-debate", "title": "Research agenda: Interpretive debate", "summary": "Researchers propose a new epistemic infrastructure to iteratively and empirically resolve interpretive questions about AI models, building on prior work on performative misalignment. The approach aims to address the difficulty of defining and measuring latent constructs like scheming and motivation, which are critical for AI safety but may not converge under unstructured empirical investigation.", "body_md": "One sentence pitch: our goal is to develop a piece of epistemic infrastructure for iteratively and empirically answering interpretive questions about AI models, where the accumulation of empirics leads to resolution of interpretive ambiguity and/or calibration of uncertainty.\n\nThis directly builds on our “performative misalignment” work ([paper](https://arxiv.org/abs/2606.08243), [LW post](https://www.lesswrong.com/posts/tZSkryA4aygKAbPFz/updates-on-performative-misalignment-1)), which we see as a minimal demonstration of one round of debate.\n\n**The difficulty and importance of interpretive questions**\n\nAI safety involves a lot of[ fuzzy](https://aligned.substack.com/p/crisp-and-fuzzy-tasks), interpretive questions:\n\nWhy do we need interpretation? Because despite[ CoTs being generally monitorable](https://arxiv.org/abs/2507.11473) right now, we can’t trust that the model is[ verbalizing](https://arxiv.org/abs/2305.04388)[ all the reasons](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?tab=t.0#heading=h.kkaua0hwmp1d) that lead to its decisions, or that our behavioral testing fully reveals its[ latent knowledge](https://arxiv.org/abs/2212.03827) and[ capabilities](https://arxiv.org/abs/2506.10139).\n\nThese questions are hard, partly because concepts like scheming and introspection are very hard to define accurately. For weaker models, the interpretive questions of interest were like: is the model exploiting[ multiple-choice](https://arxiv.org/abs/2502.14127)[ format](https://arxiv.org/abs/2309.12288) when answering questions, is the model[ exploiting lexical heuristics](https://aclanthology.org/P19-1334/) when predicting entailment, is the model[ looking only at the background](https://arxiv.org/abs/2006.09994) when recognizing objects? Those are much easier to define.\n\nDespite the difficulty, these questions are important, because the answers underpin our prediction about how the model generalizes. Even for ones that fundamentally cannot be resolved at the moment, we want calibrated uncertainty, e.g., P(scheming), both about the fundamental ambiguity of the question itself, but also for the efficacy of available tools.\n\nRecently, there have been calls to approach these interpretive questions in a more scientific manner, e.g.,[ science of scheming](https://www.apolloresearch.ai/science/science-of-scheming/),[ science of motivations](https://www.lesswrong.com/posts/DTDoyDTtC8R3bCiTx/from-personas-to-intentions-towards-a-science-of-motivations), just to name a few. I think of them as key building blocks[ safety](https://arxiv.org/abs/2505.03989)[ cases](https://arxiv.org/abs/2501.17315), which formally presents our prediction about how a model would generalize. They are also important for[ developing mitigations](https://alignment.anthropic.com/2025/alignment-faking-mitigations/).\n\n**Unstructured empirical investigation may never converge**\n\nMy biggest concern is that despite our best efforts and tools that accelerate research like Bloom, Petri, and[ auto hill-climbers](https://www.anthropic.com/research/automated-alignment-researchers), the empirical evidence we accumulate over time do not lead to effective resolution of interpretive ambiguity or calibrated uncertainty. I have two specific reasons for this.\n\nFirst, latent constructs like motivation and goal are underdefined and ambiguous. It's unclear how much ontological process is needed for the field to reach consensus on what constitutes scheming, and our target will likely change as model capabilities improve.\n\nSecond, models are highly sensitive, and it affects the data-driven instruments we use to study interpretive questions, e.g., steering vectors. For example, due to[ subliminal](https://arxiv.org/abs/2507.14805)[ effects](https://arxiv.org/abs/2602.04863) o[synthetic](https://arxiv.org/abs/2602.04899)[ data](https://arxiv.org/abs/2606.04071), two independent implementations of the same experiment using[ steering vectors](https://arxiv.org/abs/2604.25783), might lead to completely opposite findings.\n\nThis conceptual ambiguity of the hypotheses and plasticity of data-driven instruments largely doomed the pre-GPT ad-hoc interpretability research, with back-and-forth debates like how gradient-based input attribution are[ theoretically sound](https://openreview.net/forum?id=dYeAHXnpWJ4) but you can always find[ counterexamples](https://arxiv.org/abs/1710.10547), and how attention[ is not](https://arxiv.org/abs/1902.10186) or[ is not not](https://arxiv.org/abs/1908.04626) explanation. These debates are academically interesting, but did not lead to actual resolution.\n\nIn short, unorganized empirical investigation might lead to an accumulation of evidence that makes us more confused. We need something more structured.\n\n**Debate for interpretive questions**\n\nThe core idea is defeasibility: in the absence of ground truth for latent constructs like motivation, truth is not a property we verify against a reference but an epistemic status that survives adversarial challenge. A claim about a model’s motivation is warranted not because we can confirm it against some oracle, but because it has withstood structured attempts to defeat it.\n\n[Debate](https://arxiv.org/abs/1805.00899) provides an excellent blueprint for these interpretive questions: an investigative process that’s structurally biased towards the truth. Instead of designing individual methods to find a specific truth, debate specifies truth-finding processes. In interpretive debate, defensibility is implemented as three key principles:\n\n**Contrastive hypotheses enables recursive decomposition**\n\nThe first step of debate is to spell out two competing hypotheses. The field has already done this for many of these interpretive questions:\n\nThe reframing does not resolve the ambiguity directly, but it changes what needs to be investigated. Comparing two interpretations identifies a crux: a pair of more specific statements about how the behavior is produced under each interpretation, whose difference is easier to investigate than the original disagreement. The guiding question is: in what counterfactual scenarios would the two hypotheses lead to clearly distinguishable behaviors? In the original situation, both hypotheses predict the same observed behavior ; the crux identifies aspects of the situation that could be varied to produce divergent predictions. Designing experiments that target the divergence, observing the results, and using them to update the relative support for each hypothesis is one round of recursive decomposition.\n\nOf course, each of these concepts have many different confounders: lying is confounded with both untruthfulness (saying something non-factual because the model genuinely believes in that) and with instruction following (role-playing as a liar). The contrastive structure admits multiple framings, each spelling out the two hypotheses differently and identifying a different direction for disambiguation. These can be pursued as independent parallel lines of investigation. This raises a natural question: how do we aggregate results across parallel decompositions that may not agree?\n\n**Convergence across independent sources of evidence**\n\nConvergence operates along two dimensions.\n\nThe first is convergence across recursion paths. The same top-level disagreement can be recursed on in multiple ways. Each recursion path spells out the two contrastive hypotheses differently, directing investigation toward different aspects of the disagreement. Any single framing could inadvertently direct investigation toward territory where one interpretation looks better for reasons specific to that framing rather than to the model. When independent recursion paths converge on the same top-level verdict, the convergence rules out these framing-specific artifacts.\n\nThe second is convergence across instruments. The instruments we use to investigate model behavior—fine-tuning, activation steering, prompting, behavioral testing—each interact with the model through different interfaces, and each interface introduces its own artifacts. When instruments with different interfaces converge on the same verdict, the convergence rules out these single-instrument artifacts: a confound would need to operate consistently across fundamentally different modes of interaction with the model.\n\n**Compute-indexed evidence**\n\nWhen two competing hypotheses emit indistinguishable observations, we favor the one that’s simpler and is more[ self-coherent](https://arxiv.org/abs/2601.13566), e.g., applying Ocam's razor.\n\nThe concern here is that given unlimited compute, an agent (human or AI) can find evidence supporting any interpretation. The space of reasonable implementations is large enough and the model sensitive enough for a sufficiently resourced search to succeed. Rather than asking whether a claim holds at a given budget, we ask how the evidential support for each hypothesis scales as more computational resources are allocated to testing it. The result is a scaling law rather than a binary verdict.\n\nCompute-indexing tracks how evidence accumulates under constructive investigation. A complementary measure is cross-examination: testing whether a piece of evidence survives scrutiny from directions it was not optimized for. Cross-examination is more explicitly adversarial: rather than building the case for a hypothesis, it attempts to break it, probing for inconsistencies and testing whether findings generalize beyond the specific conditions that produced them.\n\nEnforcing debate rules and compute-indexing evidence—especially those produced by human researchers—is not trivial. This would require infrastructure that, for example, allow researchers to pre-register their hypotheses and experiment design. Importantly, most negative results about a hypothesis don't get published, but those are equally valuable as positive results when calibrating. We think this type of epistemic infrastructure is valuable.\n\n**Interpretive debate as epistemic infrastructure for safety cases**\n\nSafety cases are fundamentally interpretive claims: to argue that a model will behave well in conditions we haven't tested, we need to make claims about unobservable latent variable and calibrate our uncertainty of interpretive ambiguity. We argue that instead of binary verdict, we should design an algorithm that produces a structured profile of competing hypotheses, such that the accumulation of empirical evidence leads to resolution of interpretive ambiguity and calibrated uncertainty. We hope to design a family of algorithms (or, truth-finding processes) that is generally applicable to many interpretive questions like motivation, goal-directedness, emotion, introspection, such that it can work as an epistemic infrastructure for safety research.\n\n**One round of interpretive debate by human researchers**\n\nOur recent work on performative misalignment ([paper](https://arxiv.org/abs/2606.08243),[ LW post](https://www.lesswrong.com/posts/tZSkryA4aygKAbPFz/updates-on-performative-misalignment-1)) is a manual demonstration of one round of interpretive debate applied to alignment faking. This section describes what the round exercises and what it leaves open.\n\n**Contrastive hypotheses**. The contrastive framing is: is the compliance gap driven by scheming (goal preservation under consequences) or sycophancy (pattern-matching to researcher expectations)? Neither motivation can be intervened on directly — amplifying scheming requires genuine demonstrations of it, which is the open question. The contrast breaks the circularity: each hypothesis entails a distinct process the model must run, operationalizable where the motivations themselves are not. The round intervenes on what each hypothesis tracks — consequences for scheming, researcher expectations for sycophancy — both testable outside the original setup, allowing for tests for convergence.\n\n**Convergence across instruments**. Each model organism runs through SDF, steering, and prompting. SDF and steering are the primary instruments, operating below performed reasoning; prompting serves as a calibration check against SDF checkpoints with known family. The instruments converge where they can be run: three of four models are expectation-dominant under SDF, and prompting validates these results. Steering demonstrates a dissociation in Llama-3.1-70B: the consequence direction is recoverable but not action-coupled (it shifts retraining reasoning without moving compliance), the expectation direction moves both, and neither moves the gap.\n\n**What the round does not exercise**. The interventions are symmetric: one team makes an even-handed pass over both hypotheses rather than assigning debaters to argue each side. This keeps the contrastive structure intact while simplifying the apparatus, but leaves adversarial cross-examination untested. The evidential hierarchy rests on one assumption we have not tested — that SDF and steering operate below performed reasoning rather than retargeting it — which is the natural first target for a second round. Evidence is indexed by instrument and by transfer, but not yet by compute, so the output is a comparative motivation and realism profile rather than a calibrated P(scheming). And cross-examination so far means robustness checks rather than a debater assigned to break the expectation-dominant verdict.\n\nBeing one round, the demonstration also works within a single decomposition of the disagreement — the binary contrast between scheming and sycophancy that the field has already spelled out — setting aside other candidates the same behavior could express (role-play, generic helpfulness, refusal suppression) and leaving for later rounds the question of whether alternative framings reach the same verdict.\n\n**What’s next**\n\nOur key focus next is to improve the manual one-round debate, including compute-indexing, cross-examination, more convergence experiments, improving SFT and steering methodology, as well as replicating the process on other behaviors and model organisms.\n\nIn the spirit of defeasibility, we believe our methodology itself should also be red-teamed and iteratively improved. This agenda benefited a lot from feedback and critiques of our initial experiments on performative misalignment, and we hope that the community can continue to help us identify those flaws.\n\nThis agenda is based on our initial [ICML paper](https://arxiv.org/abs/2606.08629) by David Baek, Xinnuo Li, Anay Gupta, Taslim Mahbub, Kejian Shi, Max Tegmark, which is then extended by MATS 9.0 scholars David Vella Zarb, Rustem Turtayev, Taywon Min, and research manager Jinghua Oh. Discussions with Jiaxin Wen, Arush Tagade, Shawn Zhou, Jay Chooi, Avidan Shah, Peter Nutter, Dani Roytburg, and Kiante Brantley directly contributed to the agenda. It also benefited from feedback by Tianyi Alex Qiu, David Africa, Jacob Pfau, Nathaniel Mitrani, Nishant Balpeur.", "url": "https://wpnews.pro/news/research-agenda-interpretive-debate", "canonical_source": "https://www.lesswrong.com/posts/onaSmiocXtBYG5BZZ/research-agenda-interpretive-debate", "published_at": "2026-06-18 23:46:43+00:00", "updated_at": "2026-06-19 00:02:50.314097+00:00", "lang": "en", "topics": ["ai-safety", "ai-research", "large-language-models", "ai-ethics"], "entities": ["Apollo Research", "Anthropic", "Bloom", "Petri"], "alternates": {"html": "https://wpnews.pro/news/research-agenda-interpretive-debate", "markdown": "https://wpnews.pro/news/research-agenda-interpretive-debate.md", "text": "https://wpnews.pro/news/research-agenda-interpretive-debate.txt", "jsonld": "https://wpnews.pro/news/research-agenda-interpretive-debate.jsonld"}}