Beyond Recall: Behavioral Specification as Interpretive Layer for AI Researchers have introduced "Behavioral Specification" as a new interpretive layer for AI memory, arguing that current systems optimized for factual recall fail to capture how a specific person processes information into judgments and decisions. The team proposes "representational accuracy" as a measurable property of how well an AI's internal model captures a person's interpretive patterns, testing this through behavioral prediction on held-out situations. This shift matters because as AI moves from tool to agent acting on a person's behalf, the system's behavioral alignment depends on accurately representing the user's reasoning framework, not just retrieving facts. 1. Introduction 1-introduction 1.1 Recall is not interpretation. Interpretation can be measured. 11-recall-is-not-interpretation-interpretation-can-be-measured AI is moving from a tool a person uses to an agent that acts on a person's behalf, and that shift changes what "memory" must do for a specific individual. State of the art AI memory has been optimizing for recall as the success metric. The four prominent commercial systems Zep, Letta, Mem0, and Supermemory compete on standard recall benchmarks such as LOCOMO and LongMemEval, reporting accuracies in roughly the 70% to 93% range depending on provider, model, and benchmark variant §2.2 22-memory-systems-for-llm-agents . Optimizing further on recall leaves something more fundamental unmeasured. This paper examines how recall is one part of memory, and how the function of memory is dictated by how an individual processes the facts and experiences of their life. We use interpretation to refer to this human-side property: the way a specific person processes facts and experiences into judgments, decisions, and reactions. Viewing situations from different lenses can lead to entirely different interpretations of the same set of facts. This principle holds across domains: the same set of facts can produce different conclusions depending on the interpretive framework the reader brings to them. Memory is therefore personal in a deeper sense than recall: the same facts arrange differently inside different people. For an AI to serve a specific person, it must be given context on the framework that person uses to reason, not just the raw facts or information itself. Throughout this paper we use the term Behavioral Specification to refer to a static document that extracts and encodes a person's behavioral patterns; the operational definition is developed across §3.7 37-pipeline-for-the-behavioral-specification . A Behavioral Specification is an artifact that captures this interpretive framework, and is provided to an AI as context. We introduce representational accuracy as the corresponding AI-side property: how well a system's internal model of a specific person captures their interpretive patterns. It is not recall, preference matching, or persona consistency. It is a distinct property of the AI system, and state of the art memory benchmarks do not isolate it. Prior work closest to this axis Twin-2K for scaled behavioral prediction, PersonaGym for persona fidelity, AlpsBench for preference alignment measures related properties but not the transfer of a person's interpretive patterns to new situations the system has never seen. §2.1 21-prior-measurement-targets-and-the-gap-representational-accuracy-fills positions each benchmark against what this paper measures, and Appendix F develops the scope differences in detail. The core hypothesis of this research is that representational accuracy of a person's interpretation improves an AI system's behavioral alignment with that person. This is the operational primitive for any AI system meant to act on a person's behalf: the system's behavior can only match the user's reasoning to the extent the system represents that reasoning accurately. The operational test in this paper is behavioral prediction on held-out situations: given a situation drawn from text the model has never seen, the model generates how the subject would respond; the response is scored by a panel of calibrated large language model LLM judges against the subject's own verbatim response in the held-out text on a 1-5 interpretive rubric §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel . Accurate prediction on held-out text is evidence that the representation captures the subject's recurring patterns of reasoning, distinct from the facts and stylistic surface that current extraction pipelines already produce. The design also reduces the risk of sycophancy 1: the answer is checked against the person's narrative, which the model has never seen, not against anything the user says during the conversation. The held-out test is one operationalization of the hypothesis. We test this hypothesis on the leading state-of-the-art AI memory systems and on a diverse set of 14 autobiographies from authors across the world. For this initial examination we use baselined and calibrated LLM judges to evaluate the performance of each memory system, on its own and in combination with a Behavioral Specification : a static document that extracts and encodes a stable representation of a corpus's behavioral patterns. The Specification captures the recurring patterns in how the subject reasons, drawn from the shape of judgments and reactions across the corpus for example: "spiritual integrity over social cost..." , "reform through love..." , "hierarchical deference..." . A walked example of the audit chain from such a pattern back to its grounding facts and source passages appears in §2.3 23-traceability-and-reasoning-traces . Defined terms used throughout the paper are collected in Appendix H appendix-h-glossary for reference. 1.2 What we tested 12-what-we-tested We tested a Behavioral Specification across 14 historical subjects, each with a public domain autobiography sourced from Project Gutenberg or the Internet Archive per-subject sources in §3.4 34-subjects Table 3.1 . For every subject we split the source corpus in half: the training half was used to generate the Specification, to seed each memory system, and to provide the retrievable fact pool. The held-out half was used only to produce behavioral prediction questions and was never shown to the response model , the language model being asked to predict how the subject would respond. The Behavioral Specification is one of the context conditions the response model receives; the full set of conditions is defined in §3.2 32-experimental-conditions . The set of held-out questions for each subject is the question battery size and composition per subject in §3.5 35-question-battery-formation . The test was whether each system, under each tested condition, could predict how that specific person would respond in situations drawn from text it had never seen. The evaluation is a prototype benchmark for representational accuracy, not a finished one; §7 7-future-work flags the work needed to harden it into a standardized instrument. The Behavioral Specification itself is built from the training-half corpus through an extraction-and-authoring pipeline §3.7 37-pipeline-for-the-behavioral-specification . The pipeline distills the recurring patterns of how the subject reasons into a single structured document, typically around 7,000 tokens ~5,000 words long. That document is what the response model receives as context when asked to predict how the subject would respond. Hypotheses. The study tests five claims about how a representation of a person shapes AI behavior on that person's behalf: H1. A response model given a Behavioral Specification produces responses that align with the person's documented behavior more closely than the same model given no context, facts retrieved by a memory system, the full extracted fact list, or the raw source corpus §4.1 41-the-cross-subject-gradient-and-its-per-question-mechanism . H2. The Specification's benefit is inversely proportional to the response model's pretraining coverage of the person. Its effect is largest on people the model does not already know §4.1 41-the-cross-subject-gradient-and-its-per-question-mechanism . H3. The benefit comes from the content of the correct Specification for the correct person, not from the mere presence of a structured prompt. A random other person's specification, applied in its place, produces a substantially smaller and content-specific effect than the matched Specification §4.3 43-mechanism-correct-content-not-format . H4. The Specification interacts with memory-system retrieval in a structured way that depends on the type of question being asked. Aggregate effects on each memory system reflect the balance of these per-question patterns and shift with retrieval architecture §4.4 44-memory-system-composition . H5. A Behavioral Specification's quality advantage is also a compression advantage: a ~7,000-token ~5,000-word Specification recovers most of the predictive accuracy of an 80-400K-token ~60-300K-word raw corpus §4.2 42-compression-structure-vs-raw-text . Post-hoc analyses surfaced during the work are reported alongside these results. 2 user-content-fn-pre-vs-post-hoc Primary and secondary outcomes. The primary outcome is the mean prediction score on the 1-5 rubric across a 5-judge primary panel §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel . 3 Cross-subject claims are calculated subject-by-subject before averaging, so they are not driven by subjects with larger question batteries. As a secondary outcome , we report the per-question improvement rate : how often a context condition helps relative to the comparison baseline §4.2.1 421-per-question-improvement-rate , not just by how much it helps when averaged. The per-question secondary outcome is informative because each context condition behaves differently across question types: aggregate effects reflect the balance of interpretation-heavy items where the Specification lifts most and literal-recall items where retrieval already suffices . The formal proposal and failure-mode analysis for the secondary outcome are in §4.2.1 421-per-question-improvement-rate ; full operational details for both outcomes are in §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel . Each memory system is tested in both a controlled configuration identical pre-extracted fact pool and a native configuration the provider's own ingestion pipeline ; design detail in §3.2 32-experimental-conditions . Running in parallel across both is the Behavioral Specification, tested alone and layered on top of each configuration. Every meaningful combination of inputs is evaluated as its own condition: | Group | Condition | Inputs given to the model | Purpose | |---|---|---|---| | Direct context manipulations | No context C5 | Nothing. The model answers from pretraining alone. | Pretraining baseline. Measures what the model already knows about the subject from public sources. | Specification alone C2a | The Behavioral Specification, with no retrieval, no facts, and no corpus. | Tests whether structure without retrieval is sufficient on its own. | | Wrong-specification control C2c | A different subject's specification applied to this subject. Two variants: an adversarial fixed pairing v1 and a random derangement v2 . | Tests whether the effect is driven by the content of the correct Specification, or by the mere presence of structured prompting. | | All facts, no specification C4 | Every extracted fact for the subject, loaded into context at once. | Tests whether information sufficiency alone drives prediction, independent of structure. | | Facts + specification C4a | Every extracted fact plus the Specification. | Combines full information and structure to test the upper bound of context-provided prediction. | | Raw corpus, no specification C8 | The full training-half corpus loaded into context. | Tests whether unstructured source text can substitute for an interpretive representation. | | Corpus + specification C9 | Raw training corpus plus the Specification. | Tests whether structure is additive to unstructured source text. | | | Memory-system configurations controlled, all 5 systems | Retrieval alone, controlled C1 | Top-k facts retrieved by each memory system Mem0, Letta, Supermemory, Zep, Base Layer from the shared fact pool. | Tests retrieval sufficiency, and whether providers converge on which facts are most relevant given identical input. | Retrieval + specification, controlled C3 | Memory system retrieval from the shared fact pool, plus the Specification. | Tests whether the Specification layers cleanly on retrieval when the input is held constant. | | | Memory-system configurations native, 4 commercial systems | Retrieval alone, native C1 native | Top-k results from each memory system's own ingestion pipeline operating over the raw training corpus. | Real-world comparison of each memory system's full ingestion-plus-retrieval stack. | Retrieval + specification, native C3 native | Memory system's own ingestion and retrieval, plus the Specification. | Tests whether the Specification improves the real-world deployment of each memory system. | The 14 subjects span four continents and roughly two millennia of written human experience. Ordered chronologically: Saint Augustine North Africa, 4th-5th c. , Bābur Central Asia and India, 15th-16th c. , Bernal Díaz del Castillo Spain and Mexico, 15th-16th c. , Benvenuto Cellini Italy, 16th c. , Jean-Jacques Rousseau France, 18th c. , Olaudah Equiano West Africa and Britain, 18th c. , Mary Seacole Jamaica and Britain, 19th c. , Elizabeth Keckley United States, 19th c. , Yung Wing China and the United States, 19th c. , Philip Gilbert Hamerton Britain, 19th c. , Fukuzawa Yukichi Japan, 19th c. , Georg Ebers Germany, 19th c. , Sunity Devee India, late 19th-early 20th c. , and Zitkala-Ša Yankton Dakota, early 20th c. . Source corpora range from 25,231 words Hamerton to 422,772 words Bābur . Full source references are in §3.4 34-subjects . Predictions were scored on a 1-5 rubric where the integer anchors mark categorical shifts in answer quality; the full rubric and verbatim judge prompt are in §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel . For example, a move from 1.8 to 2.4 crosses the 2.0 boundary: the model goes from refusing the question or producing an off-base answer anchor 1 to engaging with the question even when the prediction is still wrong anchor 2 . Absolute point gains, not percentages, are the informative metric for cross-subject comparison. | Score | Anchor verbatim | Shift from previous anchor | |---|---|---| 1 | Refuses or off-base | rubric floor No usable answer: the model either declined 93% of score-1 responses or engaged with the wrong subject 7% | 2 | Wrong prediction | From no usable answer to engaging with the right subject | 3 | Right domain wrong outcome | From wrong prediction to the right domain | 4 | General direction correct | From right domain to the general direction of the held-out | 5 | Predicts specific outcome | From general direction to the specific outcome documented in the held-out | Score interpretation, including the cross-anchor rule for fractional scores e.g., 2.5, 3.4 , is in §3.3.1 331-score-interpretation . Example questions per subject and panel composition are in §3.3.2 332-judge-panel . The baseline we refer to throughout is the No-Context Baseline C5 : the response model's score with no external information. Low-baseline subjects are the population of relevance : people the model has insignificant pretraining understanding of, even when fragments of their digital footprint exist in training data. High-baseline subjects are the opposite, people the model already knows about from pretraining e.g., Benjamin Franklin, included in this study as a known-figure reference, §4.1.2 412-the-gradient-at-the-high-baseline-end-franklin-reference . Of the 14 main-study subjects, 9 are low-baseline C5 ≤ 2.0 and 5 are mid-baseline 2.0 < C5 ≤ 3.0 ; Franklin is the single high-baseline reference C5 3.0 and is not part of the 14. The low-baseline band is plausibly the default case for most users of frontier systems: even people with substantial public output captured in training corpora have only fragments of their reasoning represented. The study suggests the low-baseline band is the norm rather than the exception. 4 Results are reported separately on the low-baseline band n=9 alongside the full 14-subject analysis. The study is structured into two tiers. Tier 1 main study uses Claude Haiku 4.5 as the response model across all 14 subjects on every condition. Tier 2 is a smaller cross-provider directional probe §3.6 36-response-models , §4.6.1 461-cross-provider-response-generation-tier-2-replication . The 7-judge panel spans three providers; the 5-judge primary aggregate is composed of the Anthropic Claude and OpenAI GPT judges, and the 2 Gemini judges are reported as a sensitivity check on calibration grounds §3.3.3 333-calibration . Together these hypotheses test whether a Behavioral Specification can increase a language model's representational accuracy of a specific person. 1.3 What we found 13-what-we-found An interpretive layer, operationalized as a Behavioral Specification Spec , lifts representational accuracy. Its benefit is largest where the model knows the person least, and the mechanism is per-question. On questions where the model needs an interpretive frame and lacks one, a Behavioral Specification categorically improves the answer produced. On questions where the model already has the answer, a Behavioral Specification adds nothing and sometimes hurts. 5 What follows are seven findings, beginning with the cross-subject gradient primary outcome and the per-question mechanism beneath it. Prediction on held-out reasoning is what we measure; whether higher representational accuracy translates into aligned action is the downstream claim developed in §5 5-discussion and §7 7-future-work . Headline findings. Gradient. A Behavioral Specification's benefit is largest where the model knows the person least. Every one of the 9 low-baseline subjects improved when the Behavioral Specification was added on top of All Facts C4a ; per-subject mean lift +0.89 points on the 1-5 rubric over the No-Context Baseline n=9 , with 78.6% of individual questions improving 351 paired questions . 6 user-content-fn-statsig The gradient comes from the per-question mechanism described next: low-baseline subjects have more questions where the model lacks an interpretive frame, and therefore more questions where the Behavioral Specification lifts. Detail in 7 user-content-fn-conservative-lift §4.1 41-the-cross-subject-gradient-and-its-per-question-mechanism . Per-question interpretive lift. The Behavioral Specification moves 55% of low-baseline questions across at least one rubric anchor upward; 18% cross two or more. Crossing one rubric anchor moves a response from "wrong prediction" to "general direction correct." Crossing two or more anchors is a bigger jump: a single question where the model moves from refusal or generic guessing to a recognizable, person-specific response. 5.7% cross three or more anchors 20 of 351 paired low-baseline questions . The pattern holds across Spec Only C2a , All Facts + Spec C4a , and Corpus + Spec C9 conditions on the low-baseline subjects. Detail in §4.1 41-the-cross-subject-gradient-and-its-per-question-mechanism , §4.2 42-compression-structure-vs-raw-text . Compression. The Behavioral Specification recovers 75% of what the raw corpus delivers at ~25× less context on average per-subject range 7× to 79× . A 7,000-token structured representation matches most of the predictive accuracy of a 163,000-token raw corpus. A Behavioral Specification selects and structures the behavioral signal; the interpretive layer drives the result, not the volume of context. Spec Only +0.68 vs. raw corpus +0.91 over baseline.On Hamerton smallest corpus tested , the Behavioral Specification scores higher than the raw corpus 2.63 vs. 2.27 . Detail in 8 user-content-fn-compression-grain §4.2 42-compression-structure-vs-raw-text . Content specificity. A wrong Behavioral Specification drops accuracy below the No-Context Baseline Δ = −0.25 ; the correct Behavioral Specification lifts accuracy above Δ = +0.35 . What produces the lift is the content of the correct Behavioral Specification for the correct person, not the presence of a structured prompt. Random pairings a wrong Spec assigned by chance to a different subject sometimes still produce predictions that align with the held-out text, suggesting some behavioral patterns transfer across subjects, but the correct Behavioral Specification consistently outperforms. Random-derangement Δ = +0.15. Detail in §4.3 43-mechanism-correct-content-not-format . Memory-system layering. A Behavioral Specification layers cleanly on top of commercial memory systems: it moves 20–36% of individual questions up across a rubric anchor, excelling on interpretation-heavy questions and reducing refusals on questions where retrieved facts could not ground the model; conversely, a Specification negatively affects performance on recall-heavy questions where retrieval already supplied the answer. This per-question structure is what the aggregate hides: some questions improve, some regress, and the balance shifts with retrieval architecture. On aggregate, the Behavioral Specification lifts 3 of 4 commercial memory systems; Mem0, Letta, and Zep show positive mean lift under at least one configuration, Supermemory does not. Detail in §4.4 44-memory-system-composition . Hedging reduction. The Behavioral Specification collapses baseline hedging from 41.2% of responses to 0.4%. The reduction is content-specific, not prompt-driven: under the wrong-Spec adversarial control, the model continues to hedge or explicitly flag the mismatch on 60.6% of responses 9 user-content-fn-hedging §4.3 43-mechanism-correct-content-not-format . The pattern is consistent with the model carrying an implicit evidentiary bar before committing to a behavioral prediction: where the matched Behavioral Specification supplies the interpretive scaffolding that clears the bar combined with All Facts C4 or retrieval , the model commits; where it does not, the model abstains. On low-baseline subjects, where the model would otherwise refuse to engage, the matched Behavioral Specification converts refusal into substantive response. This is the gradient operating at its floor. Detail in §4.1.1 411-per-question-baseline-engagement-and-the-worked-rubric-example abstention versus non-abstention misalignment decomposition , §4.3 43-mechanism-correct-content-not-format wrong-Spec content-specificity , and §4.4.3 443-where-the-spec-helps-where-it-hurts-and-which-question-types-route-to-each evidentiary-bar pattern across memory systems . Retrieval divergence. Given identical input, memory-system providers share zero top-10 facts on 35.9% of question-pairs; mean pairwise overlap is 8.3%. On standard recall benchmarks like LongMemEval and LOCOMO, the four commercial memory systems we tested perform within a few percentage points of each other. Yet on which facts to surface as most relevant, they substantially diverge. Convergence on top-K under identical input would have been evidence of a shared interpretive framework; the systems do not converge. On 65.6% of system pair, question instances they share one or fewer facts. Detail in §4.4.1 441-cross-system-retrieval-providers-do-not-converge . Mechanism: three patterns of interaction with retrieval full development in §4.4.3 443-where-the-spec-helps-where-it-hurts-and-which-question-types-route-to-each . Baseline runs suggest the model already attempts shallow inference from a user's raw data on its own; the Specification makes that inference inspectable and structured. Pattern 1, Interpretation-heavy questions. The Specification supplies a generalized pattern from the source that has to transfer to a new situation; retrieved facts alone are not enough Fukuzawa Q26 . Pattern 2, Literal-recall questions. Retrieval already returns the plain answer; the Specification's interpretive framing drifts past the question and negatively impacts the response Yung Wing Q5 . Pattern 3, Refusal-triggering questions. When the Spec supports refusing without enough information not all specs do , the model produces principled refusals aligned with the Spec; the content-match rubric still scores them as off-base Zitkala-Ša Q18 . Robustness across providers. We varied both the question-battery generation model and the response model across providers; the Spec direction reproduces. Detail in §4.6.1 461-cross-provider-response-generation-tier-2-replication . Exploratory note: Letta stateful-agent path. Letta's stateful-agent architecture self-edits a persistent memory block during ingestion. Given that unique architecture, we ran an exploratory case study on it: on 3 subjects post-hoc , it scored above Base Layer's full-stack Behavioral Specification at matched response model. The full case study, the duplication audit, and the architectural-ceiling analysis are in §4.5 45-exploratory-case-study-letta-stateful-agent-n3-post-hoc and Appendix G. 1.4 What this implies 14-what-this-implies Representational accuracy is distinct from recall, and human-AI alignment is dependent on how accurately the user is represented. As AI agents make more decisions on a person's behalf, the infrastructure required to serve that alignment is user-held, portable, inspectable, traceable, and representation-grade. A Behavioral Specification is one implementation of this infrastructure, not the only one. The gap it fills cannot be closed by training a larger model on more public data. The private record does not exist in a form a training corpus can capture; even where fragments exist, they are scattered across formats and channels and cannot be reliably reassembled into how a specific person reasons. AI is becoming a broadly used technology, comparable to email or mobile phones in reach, but different in the amount of cognition it carries out on a person's behalf. The population of relevance §1.2 12-what-we-tested is anyone who uses or will use an AI system. Even the subjects of the autobiographies in this study, people whose work is in pretraining and who should technically be known to the model, score near the rubric floor in the No-Context Baseline. The structural options for what fills the gap are narrow: - Each person supplies their own representation to whatever AI system serves them. A Behavioral Specification is one such implementation. - Personalization remains surface-level style, voice, preference, demographic inference, observable behavior , addressing the layer current memory systems already cover but missing the interpretive framework that lets an agent act on a specific person's behalf. - AI systems infer a representation of the user from observed interactions, building it opaquely, without explicit input from the user or the ability for the user to inspect or correct it. §5 5-discussion is an extended discussion of these implications; §7 7-future-work develops the safety, alignment, and deployment implications. 2. Prior Work, Industry Benchmarks, The Fifth Target 2-prior-work-industry-benchmarks-the-fifth-target AI memory and personalization research today is organized around four measurement targets: recall of stored facts, survey-response prediction, persona fidelity, and preference alignment. Each is supported by its own benchmark family and its own line of system design. They do not measure whether an AI system has an accurate internal model of how a specific person reasons. This paper proposes a fifth target, representational accuracy , and uses behavioral prediction on held-out reasoning situations as its operational test. §2 2-prior-work-industry-benchmarks-the-fifth-target walks the four existing targets, names the benchmarks attached to each, and positions the fifth alongside them. Memory systems today optimize for recall. Recall-optimized efforts include both neural-memory-analogue systems 10 and the broader class of vector-retrieval and embeddings-based commercial memory providers Mem0, Zep, Supermemory, Letta . These systems do store and retrieve information for a specific user, but they are designed and benchmarked for recall accuracy on standard benchmarks, not how accurately the system represents that user's reasoning. The optimization target is general by construction; any individual user's interpretation is not what these systems are measured against. A separate body of research, cognitive-representation research , studies human reasoning itself: how people form representations of others, how schemas compress experience. The gap between these directions is the translation: applying what we know about human reasoning to the direct interaction between an AI system and a specific individual, and shaping the system's internal model of that individual in a way that serves them rather than serving an average. Language models are trained to produce responses that are helpful on average across a large population of users. That optimization target produces outputs that no single user is the reference point for. Personalization requires the opposite property: a system whose outputs are tuned to a specific individual rather than to a population aggregate. That kind of intentional individual-specificity, not "bias" in the negative sense but an explicit design target, is the missing thread in current AI memory and human-AI interaction research. Personalization in this paper's sense. "Personalization" in current AI research typically means responsiveness to stated preferences dietary restrictions, communication style or stored facts about the user location, occupation, history . Both are useful and both live at the surface of the user. We use "personalization" in a stronger sense throughout this paper. We mean representing the interpretive layer that sits beneath stated preferences and biographical facts: how a specific person organizes experience, what they treat as evidence, what reasoning patterns they apply across new situations. Preferences and facts are downstream artifacts of that interpretive layer; the layer itself is what produces them. The behavioral prediction battery and Behavioral Specification described in §3 3-study-design instantiate personalization in this deeper sense, and §5 5-discussion returns to what this layer is and is not. 2.1 Prior measurement targets and the gap representational accuracy fills 21-prior-measurement-targets-and-the-gap-representational-accuracy-fills This subsection walks each of the four existing targets, naming their attached benchmarks and scopes. Representational accuracy is positioned as the fifth target at the end of the walk. An extended benchmark-by-benchmark analysis is in Appendix F. Recall measures retrievability of facts, not reasoning about them. LOCOMO Maharana et al., ACL 2024, arXiv:2402.17753 measures conversational-memory quality: after a multi-session conversation, the system is asked questions like "what did the user say about their job on day 3?" and scored on fact retrieval. LongMemEval Wu et al., ICLR 2025, arXiv:2410.10813 measures long-term memory across multiple sessions on five capability dimensions single-session, multi-session reasoning, temporal reasoning, knowledge updates, abstention and is heavily recall-weighted. A system can saturate recall on such benchmarks and still fail behavioral prediction, because retrieval answers the question "can the fact be found" rather than "does the system know how the person reasons about the fact." Recall is a necessary property for most downstream uses of memory but it is not sufficient for representational accuracy. Survey-response prediction infers how a person would answer one questionnaire item from how they answered others. Twin-2K Toubia et al., 2025, arXiv:2505.17479 does this for 2,058 participants on a 17-task heuristics-and-biases battery; items share a common format multiple choice, Likert scale, numeric , scored by distance-based accuracy. Twin-2K's stated target is prediction accuracy on survey interpolation : the model is scored on how well it predicts a held-out questionnaire response, not on whether it represents the underlying reasoning that produced the response. Our target is representational accuracy on a cross-format task: autobiographical prose input, open-ended behavioral prediction output, rubric-based scoring against a verbatim held-out passage. The structured-questionnaire format and the open-ended behavioral reasoning this paper studies measure different properties. A system could perform well on Twin-2K and not on our battery survey interpolation does not require modeling reasoning transfer to new contexts , and a system could perform well on our battery and not on Twin-2K accurate reasoning representation does not guarantee survey-format numerical accuracy . The two benchmarks diagnose different properties of the same general capability. Persona fidelity measures whether a model stays in character across the back-and-forth of a conversation. 11 PersonaGym Samuel et al., Findings of EMNLP 2025, arXiv:2407.18416 scores consistency with a described persona during conversation: given a one-line persona "You are a 45-year-old skeptical accountant from Toronto" , the model is scored on whether its multi-turn replies stay in-character, graded against a held-out criterion set. In practice the model is checked for consistency with the persona's surface attributes skeptical, accountant-flavored responses; not breaking character into a different age or profession , not for whether it reproduces how a specific person would reason. PersonaGym's one-line descriptor is a substantially shallower input than this paper's ~7,000-token Behavioral Specification or Twin-2K's full-text survey persona; consistency with it does not require modeling that person's reasoning on new situations. PersonaGym measures a useful property holding voice over a conversation ; fidelity to a one-line persona is a weaker condition than representational accuracy. 12 user-content-fn-twin2k-persona-size Preference alignment measures whether responses match user preferences. AlpsBench Xiao et al., 2026, arXiv:2603.26680 evaluates whether explicit memory mechanisms improve preference-aligned and emotionally resonant responses: after ingesting a user profile, the model is asked conversational questions preferences, emotional support and responses are scored on preference alignment and emotional resonance rubrics, not on predictive accuracy. Their central finding, that recall improvement does not automatically carry into preference alignment , is arrived at independently and is complementary to this paper. Both papers point at the same gap from different sides: solving for recall is insufficient for what memory is ultimately for. Preference alignment is an outcome property whether a response matches what the user prefers . Representational accuracy is an upstream property whether the AI's internal model of the user is correct . Preference alignment is one downstream consequence of representational accuracy being correct; it is not the same property. We propose behavioral prediction on held-out reasoning situations as a test of a fifth target: representational accuracy. Prediction is the test, not the goal. We do not pursue prediction accuracy as an end in itself. The target is representational accuracy, the fidelity of an AI's internal model of a specific person, and behavioral prediction on unseen situations is the instrument we use to measure it. A prediction score tells us the representation captured something that generalizes to new situations; a low score tells us it did not. Prediction is a diagnostic; the Behavioral Specification is what this paper is testing. The held-out design rests on a stability premise. A person's interpretive patterns must be stable enough within their own corpus that what is captured from one half references what appears in the other. Without that, held-out behavioral prediction is impossible in principle, regardless of how good the representation is. The 14 main-study subjects have coherent autobiographical narratives consistent with the premise; §4.1 41-the-cross-subject-gradient-and-its-per-question-mechanism reports that the Behavioral Specification authored from training text generalizes to held-out text at above-baseline rates. Subjects whose reasoning shifts substantially across their corpus across a major career change, a profound life event, or a decades-long corpus with distinct epochs may not be well-represented by a single snapshot specification, which is one reason temporality is a flagged follow-up in §7 7-future-work . The missing axis is representational accuracy itself. Each existing benchmark family measures a real property of memory systems, and each is useful for its own target. What is missing is an axis that measures how accurately the memory system represents the person whose behavior it is meant to anticipate. This paper's approach is a prototype answer on that axis, not a finished benchmark. §7 7-future-work flags a differentiated rubric one that separates interpretation-heavy from literal-recall questions, and scores epistemic honesty as its own dimension as the priority follow-up for turning this prototype into a standardized benchmark. A single number does not capture a memory system's full capability. Recall, survey-response prediction, persona fidelity, preference alignment, and representational accuracy are distinct axes. A system that saturates one may do nothing on another. Production-grade evaluation of memory systems should report results on multiple axes rather than on any single one. 2.2 Memory systems for LLM agents 22-memory-systems-for-llm-agents The four commercial memory systems we evaluate Mem0, Letta, Supermemory, Zep have converged on a shared set of capabilities: semantic retrieval over embedded content, source attribution, multi-level memory structures, and benchmark-validated recall performance. They differ in how each of these is architected. None positions representational accuracy or behavioral prediction of a specific individual as a design target. Table 2.1. Memory system comparison. Verified against primary sources. | Provider | Core architecture | Retrieval method | Memory types | Published recall score | |---|---|---|---|---| Mem0 | Extract → consolidate → retrieve pipeline; Mem0g graph variant adds a directed labeled knowledge graph alongside the vector store | Hybrid: semantic + keyword + entity | Conversation, session, user, organizational | 91.6 LOCOMO, 93.4 LongMemEval current algorithm | Letta / MemGPT | LLM-as-operating-system; virtual context management with main context plus external context | Archival via archival memory search ; main-context memory blocks self-edited via core memory append , core memory replace | persona and human blocks in main context; archival and recall memory external | 74.0% on LOCOMO with GPT-4o-mini | Supermemory | Five-component architecture: chunk-based ingestion, relational versioning, temporal grounding, hybrid search, session-based ingestion | Hybrid with reranking and query rewriting; source chunks injected at retrieval | Contextual memories, relational versions, session data | 81.6% / 84.6% / 85.2% on LongMemEval s with GPT-4o / GPT-5 / Gemini-3-Pro self-reported | Zep | Built on Graphiti Apache 2.0, open source . Bi-temporal knowledge graph | Hybrid: semantic + BM25 + graph traversal | Episodes ground-truth source , Entities, Facts-as-triplets with temporal validity windows | 71.2% on LongMemEval with GPT-4o | All four systems report recall scores in the 70-93% range; on the standard recall benchmarks, scores are localizing in the 80% range and recall is no longer as much a frontier issue as it used to be. 16 All four are sophisticated systems that solve real problems in memory management. They optimize for storing, organizing, and retrieving what a person said or did. Of the four systems, Letta the commercial productization of the MemGPT stateful-agent architecture, Packer et al., 2023, arXiv:2310.08560 is architecturally distinct: it is the only one whose core architecture treats memory as something an agent synthesizes during conversation rather than stores for later retrieval. 17 This stateful-agent design is examined separately as a post-hoc case study in §4.5 45-exploratory-case-study-letta-stateful-agent-n3-post-hoc full case study in Appendix G , distinct from the archival-retrieval path Letta exposes for the main-study conditions. A Behavioral Specification targets the interpretive layer that sits above retrieval, which three of the four Mem0, Supermemory, Zep do not model at all, and which the fourth Letta models implicitly through agent-initiated memory editing that our main-study configuration did not exercise see §4.3 43-mechanism-correct-content-not-format and §4.5 45-exploratory-case-study-letta-stateful-agent-n3-post-hoc . 2.3 Traceability and reasoning traces 23-traceability-and-reasoning-traces Traceability operates at two levels. Fact-level traceability answers where a retrieved claim came from. Reasoning-level traceability answers why the system believes this about this person. The four memory systems we evaluate provide the first; representing how a person reasons requires the second. Representational accuracy operationalizes interpretation, and interpretation cannot be verified at the fact level alone. A system that represents how a person reasons must be auditable by that person, or the representation is a black box they cannot verify. Zep has the strongest explicit fact-level provenance of the four: every entity and relationship traces back to the episode IDs that produced it. Supermemory returns source chunks alongside retrieved memories. Mem0 tracks ingestion provenance through timestamps. Letta exposes agent state and memory-block edit history rather than fact-level provenance. The Behavioral Specification, as implemented in this study's reference pipeline, is structured so that every claim it expresses is a piece of reasoning, not just a piece of content. An interpretive pattern is grounded in the facts that imply it, and each fact is grounded in the source passage it was extracted from. Walking this chain backward gives a user a reasoning trace: not only where a belief originated, but what line of reasoning connects the source text to the interpretive claim. A reasoning trace operates in four steps: - The user reads the system's output a prediction, an analysis, a stated pattern . - The user identifies the interpretive claim or phrase they want to audit. - They walk from that claim to the pattern statement in the system's representation of the person that licensed it. - They walk from the pattern statement to the facts that ground it, and from each fact to the source passage it was extracted from. The chain is bidirectional. From the response, the user walks down to the source. From the source, they walk back up to the response. If a fact misrepresents its source, correcting the fact propagates upward: the pattern statement that depended on it changes, and the response that depended on the pattern changes the next time the system is queried. A worked example walking the chain on a single Sunity Devee question is in Appendix D.6. The four commercial memory systems can answer "where is this fact stored?" but cannot answer "why does the system believe this about this person?" — because their architectures treat facts, not interpretations, as the unit of storage. A representation that acts on a person's behalf must be auditable at the interpretation level by that person, or the representation is a black box. Reasoning-level traceability is what makes that auditability operational. A person should be able to inspect the system's model of them, challenge any step in the reasoning, and correct it if it is wrong. A fact-attribution memory system lets the person audit what the system stores. A reasoning-trace specification lets the person audit what the system believes. The first is a feature. The second is the minimum bar for a representation that acts on someone's behalf. 2.4 Cognitive and representational foundations 24-cognitive-and-representational-foundations Six prior research directions shaped how we designed this paper's test. Each motivates a specific choice about what to measure, what to compare against, or what failure mode to expect. Bartlett 1932 established that human memory is reconstructive and schema-driven rather than literal playback. Reconstruction follows the organizing structures a person has built up over time, not a record of the original event. A Behavioral Specification is computationally analogous: a structured compression meant to carry the signal of a person's reasoning without storing every fact about them. We designed the Specification with a schema-like architecture anchors, core, predictions precisely so we could test whether it does the work a human schema does: enable accurate anticipation of behavior in situations never encountered in the source data. Our 50/50 train/held-out split is the experimental realization of this question. Hinton et al. 2015 showed that compressing a large neural network into a smaller one preserves "dark knowledge," the relationships between outputs that carry more information than the outputs themselves. This result motivates one of our central experimental comparisons: on matched token budgets, does a compressed interpretive artifact carry more predictive signal than the raw content it was derived from? The Hamerton condition in §4.2 42-compression-structure-vs-raw-text 4,500-token Spec vs. 33,000-token training corpus at 2.63 vs. 2.27 on the 5-judge primary panel is a direct test of that question in the personal-representation setting. Chen et al. 2025 Chen, Arditi, Sleight, Evans, Lindsey; arXiv:2507.21509 show that the character a model takes on its "persona" is encoded in specific directions inside the model's internal numeric state persona vectors , and that those directions can be identified, monitored, and nudged to shift the model's behavior in predictable ways. Their approach modifies the model; ours informs the model from outside via context. Both validate that persona is a real, manipulable structure: one reachable through the model's internals, the other through context. We chose the context route because it produces a portable artifact users can own and audit, which internal activation steering does not. This choice shows up in the experiment as using a static response model Haiku served a variable context, rather than a fine-tuned or activation-steered model. Jiang et al. COLM 2025, arXiv:2504.14225 find that frontier models achieve only ~50% accuracy on dynamic user profiling tasks even with full conversation access. The paper documents the failure empirically; our reading is that the cause is the gap between having facts and having the interpretive structure to apply them to new situations. Jiang's paper is the most direct existing evidence for the gap this paper studies, and our test design inherits from it: behavioral prediction on scenarios drawn from held-out text that the model has not seen, with all relevant facts retrievable, measures exactly the interpretive-application gap. Jain et al. 2025, arXiv:2509.12517 find that adding conversation context to LLMs makes them more sycophantic: more likely to agree with the user even when the user is wrong, more likely to adopt the user's perspective on a question. Their result shows that context without the right structure pushes the model toward what the user appears to want rather than toward a grounded answer. This is why our experiment includes a wrong-Spec control §1.3 13-what-we-found Mechanism : we hand the model a Behavioral Specification that does not match the actual subject. If models drifted purely toward whatever context they are given, the wrong-Spec should behave like any other structured prompt. Instead the model flags the mismatch or attempts a low-quality application, neither of which is sycophantic drift per-condition rates in §4.3 43-mechanism-correct-content-not-format . Jain's finding plus our wrong-Spec result bracket the question from both sides: context shape matters Jain , and content matters too. Lu et al. 2026, arXiv:2601.10387 identify what they call the Assistant Axis: a dominant internal direction that anchors assistant models' default behavior toward generic helpfulness and harmlessness. A Behavioral Specification can be read as an external override to the Assistant Axis on a per-user basis: a structured anchor that shifts the model from "generic helpful assistant" toward "reasons as this specific person would reason." This framing motivated our choice to measure hedging as a primary outcome alongside accuracy: if the Spec shifts the model off the generic Assistant Axis, the behavioral change should show up both in what the model predicts and in what it is willing to commit to. Our hedging-reduction finding §1.3 13-what-we-found Mechanism, §4.3 43-mechanism-correct-content-not-format is consistent with this reading. 3. Study Design 3-study-design The experimental strategy holds the response model constant and varies the context conditions: nothing pretraining only , retrieved facts, raw corpus, a Behavioral Specification, or combinations of those. This lets us evaluate each context condition for representational accuracy against the No-Context Baseline, while isolating the contribution of the interpretive layer across all of them, separate from model capability, provider, or fine-tuning regime. All measurement choices are reflected in §4 4-results , and the statistical commitments were pre-locked before final analysis. The apparatus is described in seven parts: §3.1 31-operationalizing-representational-accuracy-via-the-behavioral-specification establishes the property being measured; §3.2 32-experimental-conditions specifies the experimental conditions; §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel defines the scoring rubric and the calibrated LLM judge panel; §3.4 34-subjects introduces the subjects; §3.5 35-question-battery-formation covers the question batteries and circularity controls; §3.6 36-response-models names the response models; §3.7 37-pipeline-for-the-behavioral-specification describes the pipeline that produces the Behavioral Specification the reference implementation of the interpretive-layer artifact tested in this study . 3.1 Operationalizing representational accuracy via the Behavioral Specification 31-operationalizing-representational-accuracy-via-the-behavioral-specification Representational accuracy is the AI-side property: how faithfully a model's internal model of a specific person captures that person's reasoning patterns. It is a property of the AI system, not of any specific operationalization. Multiple routes can in principle produce it model fine-tuning, persona-vector steering, retrieval over structured facts . This paper operationalizes one such route: a Behavioral Specification served as context to a static response model, tested for whether an interpretive layer can increase representational accuracy relative to the other context conditions in §3.2 32-experimental-conditions . The instrument we use to measure representational accuracy is behavioral prediction on held-out situations. Held-out passages from the subject's autobiography serve as samples of situations the model has not seen. The model is asked to predict how the subject would respond; the response is scored against the verbatim held-out passage. Prediction here is the test, not the goal: §2.1 21-prior-measurement-targets-and-the-gap-representational-accuracy-fills develops this distinction. Three things have to hold for a served Behavioral Specification to register a positive score. Each is a property of the Spec or the serving step, not of representational accuracy itself: - The person has behavioral patterns consistent enough to be captured in a Behavioral Specification. - The Behavioral Specification actually carries that signal. - A model given the Behavioral Specification can act on it. Prediction on held-out situations tests all three at once. When the score is low, one of these three is failing: the patterns are not consistent, the Spec is wrong, or the model is not using the Spec well. Each failure mode is informative. We do not claim to modify the model's internal parameters. Each condition in §3.2 32-experimental-conditions varies what is served to the model at inference time: nothing, retrieved facts, the full extracted fact set, the raw source corpus, a Behavioral Specification, or combinations of these. The model's resulting prediction is what we score against the verbatim held-out passage. The No-Context Baseline isolates the model's pretrained representation of the subject. The fact and corpus conditions isolate what the model can infer from raw information at runtime. The Behavioral Specification isolates what a structured interpretive layer adds on top. The study reports each. In practice, we record representational accuracy as the mean predicted-behavior score 1-5 scale across each subject's behavioral-prediction battery, aggregated across the calibrated judge panel; the rubric, panel composition, and aggregation rule are defined in §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel , and the guide to interpreting fractional scores and anchor crossings is in §3.3.1 331-score-interpretation . 3.2 Experimental conditions 32-experimental-conditions Each condition is a specific combination of inputs served to the response model §3.6 36-response-models against the same behavioral battery §3.5 35-question-battery-formation . Every condition is run on all 14 subjects §3.4 34-subjects . The Behavioral Specification and the extracted fact set used across the conditions below are produced by the pipeline in §3.7 37-pipeline-for-the-behavioral-specification . The conditions separate into two groups, summarized in the table below and broken out in detail after. All conditions, by group. | Group | ID | Condition | Inputs served | |---|---|---|---| | Direct context manipulations | C5 | No-Context Baseline | No context beyond the question | | C2a | Spec Only | The Behavioral Specification | | | C2c | Wrong Spec | A different subject's Spec | | | C4 | All Facts | The full extracted fact set | | | C4a | All Facts + Spec | Full facts plus the Spec | | | C8 | Raw Corpus | Full training corpus half the source text | | | C9 | Raw Corpus + Spec | Training corpus plus the Spec | | | Memory-system configurations controlled, all 5 systems | C1 | Retrieval Controlled | Top-k facts shared fact pool | | C3 | Retrieval Controlled + Spec | Top-k facts + Spec shared fact pool | | | Memory-system configurations native, 4 commercial systems | C1 native | Retrieval Native | System's own ingestion | | C3 native | Retrieval Native + Spec | System's own ingestion + Spec | Direct context manipulations. We specify the model's input directly: no context baseline , the Behavioral Specification, the extracted fact set, the raw corpus, or combinations of these. No retrieval step intervenes. Each condition isolates what one input type or combination contributes. | ID | Condition | Inputs served | Null / comparison | |---|---|---|---| | C5 | No-Context Baseline | Nothing beyond the question | Pretraining-only floor | | C2a | Spec Only | The Behavioral Specification | Isolates the Spec's contribution | | C2c | Wrong Spec | A random other subject's Spec | Tests whether structured interpretive content, not the correct content, produces the effect | | C4 | All Facts | The full extracted fact set for the subject | Tests whether raw information volume substitutes for structure | | C4a | All Facts + Spec | Full facts plus the Spec | Tests whether the Spec adds value on top of raw facts | | C8 | Raw Corpus | Full training corpus half the source text | Tests whether uncompressed source text substitutes for structure | | C9 | Raw Corpus + Spec | Training corpus plus the Spec | Tests whether the Spec adds value on top of raw source | Memory-system configurations. Retrieval is performed by a memory-system provider's production deployment. The memory-system conditions run in two modes: a controlled mode each system retrieves from an identical pre-extracted fact set and a native mode each system ingests the raw corpus through its own pipeline . Five memory systems are evaluated: Mem0, Letta, Supermemory, and Zep, plus Base Layer as our own open-source reference implementation pipeline detail in §3.7 37-pipeline-for-the-behavioral-specification . Architectural detail for the four commercial systems is in §2.2 22-memory-systems-for-llm-agents Table 2.1. Controlled configuration all 5 systems . Each system retrieves from an identical pre-extracted fact set, isolating retrieval-algorithm differences from ingestion-pipeline differences. | ID | Configuration | Inputs served | |---|---|---| | C1 | Retrieval Controlled | Top-k facts returned by the system for the question controlled fact pool | | C3 | Retrieval Controlled + Spec | Top-k retrieval output plus the Behavioral Specification controlled fact pool | Native configuration 4 commercial systems . Each system ingests the raw training corpus through its own pipeline, reflecting real-world deployment. C1 Retrieval Only and C3 retrieval + Spec are run with the system's own ingestion pipeline replacing the controlled fact pool. | System | Conditions | Ingestion Process | |---|---|---| | Mem0 | Mem0 Retrieval Native , Mem0 Retrieval Native + Spec | Mem0 reads the corpus and decides on its own how to break it up, extract facts, and retrieve them. | | Letta archival | Letta Retrieval Native , Letta Retrieval Native + Spec | The corpus is loaded into Letta's archival memory store; retrievals happen at query time from that store. | | Supermemory | Supermemory Retrieval Native , Supermemory Retrieval Native + Spec | Supermemory chunks the corpus on its own and applies reranking to the retrieved chunks at query time. | | Zep | Zep Retrieval Native , Zep Retrieval Native + Spec | Zep ingests the corpus into a knowledge graph via Graphiti and retrieves using a hybrid of semantic similarity, keyword match, and graph traversal. | Both configurations are reported so retrieval-quality differences and ingestion-pipeline differences can be read separately. Base Layer is run in the controlled configuration only: its retrieval uses the same fact set that feeds the Spec pipeline. Detailed per-condition parameters, exclusion cases, and ingestion specifics are in Appendix C. 20 user-content-fn-conditions-data 3.3 Scoring rubric with calibrated LLM judge panel 33-scoring-rubric-with-calibrated-llm-judge-panel Every response is scored 1-5 by an LLM judge panel against the verbatim held-out ground-truth passage. The primary aggregate uses five judges; two additional judges contribute to the sensitivity check §3.3.2 332-judge-panel . Human annotation at this scale is feasible ~14 subjects × 40 questions × 15+ conditions but was not done; running more conditions and more judges instead was the central evaluation trade-off. LLM-as-judge is grounded in prior work showing strong correlation with human raters and reliability gains from panel aggregation Zheng et al. 2023; Verga et al. 2024 . Extending the panel with human annotators on a stratified subset is flagged as the priority measurement follow-up in §7.1 71-subject-and-corpus-expansion . The rubric and panel are a prototype scoring instrument, not a settled standard; §3.3.6 336-rubric-handling-limitations-post-hoc-validity-audit audits how the rubric behaves in practice. The evaluation is deliberately recursive. Response models are evaluated by judges §3.3.2 332-judge-panel . Judges are evaluated by calibration diagnostics §3.3.3 333-calibration , inter-judge agreement metrics §3.3.4 334-inter-judge-agreement , and post-hoc rubric-handling audits §3.3.6 336-rubric-handling-limitations-post-hoc-validity-audit . No single layer is treated as ground truth; each layer's behavior is itself measured and disclosed, and where a layer's behavior diverges from what the rubric intends, the divergence is flagged rather than corrected silently. In the absence of human annotation, the paper strengthens rigor through this stacked-instrument structure, not by trusting any one step. Scoring rubric. The following is the verbatim prompt the judges executed canonical for both the 5-judge primary panel and the 7-judge sensitivity panel; source: scripts/judge hamerton 5judge.py : You are evaluating whether a response about a person PREDICTED what actually happened. === HELD-OUT GROUND TRUTH === {held out} === RESPONSE === {response text :1500 } Rate 1-5: 5=Predicts specific outcome 4=General direction correct 3=Right domain wrong outcome 2=Wrong prediction 1=Refuses or off-base Respond with ONLY a single digit 1-5 . The prompt is deliberately spare. It frames the task in one sentence, shows the held-out passage and the response, and asks for a single digit against five anchor labels. There are no calibration examples in the prompt, no scoring guide, and no chain-of-thought scaffolding. The anchors themselves are directive refuses or off-base, wrong prediction, right domain wrong outcome, general direction correct, predicts specific outcome ; resolution at the boundaries between adjacent anchors is left to the judge model. The design choice is to test whether frontier models converge on the behavioral-prediction construct from minimal instruction. §3.3.4 334-inter-judge-agreement reports the inter-judge agreement obtained under this design. Each response is scored against the verbatim held-out passage from which the question is drawn; the score reflects how closely the response predicts the documented behavior. Battery composition is detailed in §3.5 35-question-battery-formation . Condition identifiers C5, C2a, C4a, C3 refer to the conditions defined in §3.2 32-experimental-conditions and summarized in Appendix C; rubric anchor numbers 1 through 5 refer to the verbatim anchors above. Score interpretation against the construct is developed in §3.3.1 331-score-interpretation with worked examples. A worked rubric example alongside the No-Context Baseline engagement analysis is in §4.1.1 411-per-question-baseline-engagement-and-the-worked-rubric-example ; full per-subject score distributions with verbatim responses are in Appendix D. 3.3.1 Score interpretation 331-score-interpretation Scores are read at three related granularities: integer rubric anchors 1 through 5 , fractional means produced by averaging across the 5-judge primary panel e.g., 2.87, 3.12, 2.34 , and crossings between integer anchors when conditions change. Fractional shifts should be read through the integer anchors, because each anchor corresponds to a categorical shift in response quality. The cross-anchor interpretation rule. A fractional delta that crosses an integer anchor reflects a real shift in the underlying response distribution. A delta that stays inside a single anchor is a within-category shift and a weaker claim. The verbatim anchors describe stages of prediction quality. The construct reading column below gives the plain-language meaning of each crossing: what actually changes in the response when a condition moves the score across that boundary. The paper reads each crossing as a categorical shift in the construct the rubric is designed to measure: alignment of the response with the held-out behavioral pattern. | Boundary crossed | Verbatim anchor shift | Construct reading | |---|---|---| | 1 / 2 | Refuses → wrong prediction | The model stops declining and starts engaging with the question, even if the prediction is wrong. | | 2 / 3 | Wrong prediction → right domain wrong outcome | The answer becomes specifically about this subject rather than a generic stand-in, even if the predicted outcome is still off. | | 3 / 4 | Right domain → general direction correct | The answer gets the general direction of the subject's behavior right, not just the topic. | | 4 / 5 | General direction → specific outcome | The answer gets the specific outcome right, matching the particular behavioral pattern in the held-out passage. | Here, domain means the area of behavior a question concerns how the subject handles conflict, makes decisions, treats authority, and so on ; a response is in the right domain when it engages the correct behavioral area, even when the specific predicted outcome is wrong. The boundary crossings are illustrated through worked examples in §4.1 41-the-cross-subject-gradient-and-its-per-question-mechanism Examples A, B, C and §4.1.1 411-per-question-baseline-engagement-and-the-worked-rubric-example , each drawn directly from experiment output that feeds the §4 4-results aggregate. Example A in §4.1 41-the-cross-subject-gradient-and-its-per-question-mechanism illustrates a 1 → 4 crossing wrong-referent correction ; Example B illustrates a 2 → 5 crossing directional correction ; Example C illustrates a 2.80 → 5 crossing abstention to substantive inference . Each example shows the verbatim question, held-out, and response excerpts paired with the cached judge score for that response. 21 user-content-fn-boundary-illustration-sourcing What a 1 means and does not mean. A score of 1 reflects a baseline failure to produce a usable prediction about the named subject: the response either explicitly declined to predict abstention or engaged with the question but landed on a categorically incorrect answer non-abstention misalignment, including wrong referent, off-base inference, or confusion with a different subject . It is not a claim that the response was non-fluent or empty, and it is not a claim that the model lacks any related knowledge; the score reflects only that the response failed the held-out comparison. Each question tests one behavioral sample at a time; the aggregate fraction of score-1 responses across roughly 40 questions per subject is what the paper reads as the per-subject baseline-failure rate. How score-1 responses split between explicit abstention and non-abstention misalignment is analyzed in §4.1.1 411-per-question-baseline-engagement-and-the-worked-rubric-example . What a 5 means and does not mean. A score of 5 reflects alignment with one specific behavioral sample: the held-out ground-truth passage the question is drawn from. It is not a claim that the response fully represents the subject in some absolute sense, and it is not a claim that the same response would score 5 on a different held-out passage from the same subject. Each question tests one behavioral sample at a time; the aggregate across roughly 40 questions per subject is what the paper reads as the subject-level score. As with a 1, a score is only treated as reliable where the judge panel converges on it: interrater agreement §3.3.4 334-inter-judge-agreement is the paper's signal of consensus, not any single judge's number. Multi-anchor crossings: the strongest categorical signal the rubric detects. A multi-anchor crossing is a single question whose 5-judge primary mean shifts across two or more integer rubric anchors when the condition changes. Crossings can span two anchors e.g., 1 → 3, 2 → 4 or, more rarely, three e.g., 1 → 4, 2 → 5 . Larger crossings indicate larger categorical jumps in the same response, with five independent judges converging on the move. §4.2 42-compression-structure-vs-raw-text reports the rates of these crossings and the response-level phenomena that produce them; worked examples are in §4.1.1 411-per-question-baseline-engagement-and-the-worked-rubric-example and Appendix E. The paper applies the cross-anchor rule consistently. Score deltas reported in §4 4-results are read through this lens. A +0.50 delta that crosses a rubric anchor is treated as a stronger claim than a +0.50 delta that does not. 22 user-content-fn-anchor-crossing-data Reading scores within integer anchors. The 5-judge primary panel detects within-anchor signals cleanly. 23 Across the 18 condition pairs analyzed, roughly 18% of paired questions show same-anchor fractional shifts of at least 0.5 rubric points a within-category shift, weaker than a cross-anchor crossing per the rule above . The integer metric is used throughout 24 user-content-fn-within-anchor-data §4 4-results for cross-anchor categorical interpretation; the within-anchor signal is reported here as methodological transparency. 3.3.2 Judge panel 332-judge-panel Seven judges from three providers give the numeric aggregate its weight. Zheng et al. 2023 established that a single strong LLM judge correlates with human judges on comparable tasks at rates similar to human-human agreement. Subsequent panel-based work Verga et al. 2024 and follow-ons showed that aggregating multiple LLM judges past a small panel size further tightens agreement and reduces single-model idiosyncrasy. Seven judges across three providers is well past that threshold. | Judge | Provider | |---|---| | Claude Haiku 4.5 | Anthropic | | Claude Sonnet 4.6 | Anthropic | | Claude Opus 4.6 | Anthropic | | GPT-4o | OpenAI | | GPT-5.4 | OpenAI | | Gemini 2.5 Flash | | | Gemini 2.5 Pro | The verbatim judge prompt is shown in §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel and is canonical for both the 5-judge primary panel and the 7-judge sensitivity panel. Each judge receives the held-out ground-truth passage, the subject context name, source , the prediction question, and the response to score, not the condition label or the response-generating model. Judges do not see other judges' scores. Response generation is similarly blinded: the response model receives only the question plus the condition-specific context block, with no signal that distinguishes which experimental condition it is operating under the prompt schema in §3.6 36-response-models is identical across conditions; only the injected context block changes . The specification-effect claim. When a Behavioral Specification is served to the model as context, the model's responses shift in the direction of the subject's demonstrated behavioral patterns, and that shift registers as a measured increase in representational accuracy against held-out passages from the same subject. This is the directional claim the panel is built to test; not a claim that the model has gained a new behavioral-prediction capability, and not a claim that the higher-scoring response is the absolute "correct" answer for the subject. The panel is designed for directionality, not absolute precision. Calibration §3.3.3 333-calibration and inter-judge agreement §3.3.4 334-inter-judge-agreement test how consistently the panel detects the directional shift. 3.3.3 Calibration 333-calibration The calibration diagnostic measures whether each judge applies the rubric anchors as the rubric defines them on synthetic inputs with known correct scores. It does not use any subject's responses. The four inputs are constructed examples: verbatim ground truth, paraphrased ground truth, partial ground truth first sentence only , and ground truth plus generic padding. All seven judges were tested against this diagnostic. 25 Full panel composition with per-judge calibration-status flags is in Appendix C.5. Diagnostic tests. | Test | Input | Expected | What it measures | |---|---|---|---| | Verbatim | Response = ground truth | 5.0 | Recognizes perfect match | | Paraphrased | Correct content, different wording | ~5.0 | Penalizes paraphrase? | | Short correct | First sentence of ground truth | <5.0 | Partial content scored partial? | | Long correct | Ground truth + generic padding | 5.0 | Length inflation effect | Results. | Test | Haiku | Sonnet | Opus | GPT-4o | GPT-5.4 | Gemini Flash | Gemini Pro | |---|---|---|---|---|---|---|---| | Verbatim | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 4.15 | | Paraphrased | 4.75 | 5.00 | 5.00 | 5.00 | 5.00 | 4.70 | 3.55 | | Short correct | 3.80 | 4.35 | 4.20 | 4.05 | 4.20 | 3.85 | 2.85 | | Long correct | 5.00 | 5.00 | 5.00 | 3.35 | 4.80 | 3.80 | 1.20 | Six of the seven judges score verbatim matches at 5.0; Gemini Pro is the outlier at 4.15. Haiku, Sonnet, and Opus are clean across all four diagnostics: no verbatim miss, no paraphrase penalty, no length-padding penalty, expected dip on partial content 4.20 to 4.35 . GPT-4o and GPT-5.4 are clean on verbatim and paraphrase but penalize padded-correct responses on the long-correct diagnostic GPT-4o 3.35, GPT-5.4 4.80 ; Gemini Flash and Gemini Pro do the same more severely 3.80 and 1.20 . Per-judge calibration data and full panel composition are in Appendix C.5; raw scoring data at results/judge calibration/ . Use of calibration data. Scores are not normalized. Any normalization requires deciding which judge's profile is "correct" and re-scaling the others toward it. Calibration data is published in its raw form so readers can apply their own normalization if they prefer. Primary aggregate: 5-judge panel. The primary numeric aggregate reported throughout §4 4-results is the 5-judge mean using Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-4o, and GPT-5.4. All five score verbatim and paraphrased matches at or near 5.0 and span two provider families Anthropic and OpenAI . GPT-4o and GPT-5.4 penalize padded-correct responses on the long-correct diagnostic; that deviation runs against length inflation rather than toward it, so it is conservative for the Spec-effect direction this paper measures. This is the panel the §3.3.5 335-aggregation-and-statistical-analysis-plan aggregation rule operates on, and the panel inter-judge agreement is reported for in §3.3.4 334-inter-judge-agreement . Sensitivity aggregate: 7-judge panel. Both Gemini judges are reported separately as a 7-judge sensitivity check rather than rolled into the primary. Gemini Pro fails the verbatim-match diagnostic 4.15 where every other tested judge scores 5.00 and penalizes padded-correct responses severely 5.00 on short correct dropping to 1.20 on long correct . Gemini Flash passes verbatim cleanly but shows consistent length sensitivity 5.00 verbatim dropping to 3.80 on long correct . On actual study responses, both Gemini judges show a systematic +1-point magnitude inflation relative to the five primary judges. The combination of Pro's verbatim failure, Flash's length sensitivity, and the shared inflation pattern places both Gemini judges in the sensitivity aggregate rather than the primary, while preserving them as a cross-provider robustness check in §4.6.2 462-judge-panel-sensitivity-5-judge-primary-vs-7-judge . The 5-judge primary is also the conservative choice: including the Gemini judges produces larger Spec-effect deltas, not smaller ones full numbers in §4.6.2 462-judge-panel-sensitivity-5-judge-primary-vs-7-judge . Reporting on the primary aggregate is therefore the lower-bound estimate. 26 user-content-fn-judge-calibration-data 3.3.4 Inter-judge agreement 334-inter-judge-agreement The judge panel is designed to detect the directional shift named by the Specification-effect claim §3.3.2 332-judge-panel . Two complementary agreement measures answer different questions about whether judges detect that shift consistently: direction pairwise Spearman ρ and absolute magnitude Krippendorff α . Direction agreement: pairwise Spearman ρ. Spearman ρ measures whether two judges rank the same set of items in the same order. ρ = 1 is perfect ranking agreement; ρ = 0 is no rank agreement; ρ ≥ 0.8 is conventionally treated as strong rank agreement. For each pair of judges in the 5-judge primary panel 10 pairs across Haiku, Sonnet, Opus, GPT-4o, GPT-5.4 , pairwise Spearman ρ ranges from 0.86 to 0.93 . 27 The five primary judges agree on the ranking of conditions: whatever any individual judge's absolute calibration quirks, they converge on which conditions produce better responses. For the directional claim does the Specification produce a measurable, consistent shift toward the held-out behavioral pattern , this is the statistic that matters. Magnitude agreement: Krippendorff α ordinal . Krippendorff α measures whether judges give the same response the same numeric score not just whether they rank items in the same order . α = 1 is perfect agreement; α = 0 is no better than chance; α < 0 is systematic disagreement. Krippendorff's guidance cites α ≥ 0.8 as high reliability and α ≥ 0.667 as substantial reliability. The 5-judge primary panel scores α = 0.659 , just below the substantial-reliability threshold α ≥ 0.667 , placing the panel in the tentative-conclusions band. 28 Combined with Spearman ρ of 0.86 to 0.93 above, this is the empirical signature of the design choice: directional rankings converge across judges while absolute magnitudes diverge. The 7-judge panel including the Gemini judges drops to α = 0.535 . This drop reflects the systematic +1-point Gemini inflation: Gemini judges score responses about one point higher on average than the five primary judges, so absolute values disagree even when rankings match. This is why the calibration audit §3.3.3 333-calibration excluded the Gemini judges from the primary aggregate. The α value places a ceiling on how precisely any individual fractional score should be read, which is why the paper treats per-subject deltas that stay inside a single rubric anchor as weaker than deltas that cross one. The agreement is achieved on a deliberately spare rubric. The judge prompt in §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel is five anchor labels and a one-sentence task framing. No calibration examples, no scoring guide, no chain-of-thought. Five frontier judges across three providers converge to ρ = 0.86 to 0.93 on rank order and α = 0.659 on absolute magnitude under this scaffolding. The anchors are directive but their boundaries are left to the judge to resolve, and the resolutions converge across providers. The agreement is not the product of an elaborate rubric forcing judges into the same scoring frame; it is independent provider models agreeing on what the task is asking from minimal instructions. Why directional inference survives middle-anchor noise. Judges agree most on the extreme anchors 1 and 5 and disagree most often on the middle ones, whether a response sits at 2 or 3, or at 3 or 4. This kind of disagreement appears equally across every condition, so it does not push one condition's scores systematically above or below another's. Individual scores carry some noise; the comparison between conditions does not. The panel does not establish that any higher-scoring response is the absolute correct answer for the subject; that determination requires human annotation against the subject's actual writing, which we do not have. What the panel provides is cross-provider directional convergence: three independent providers' models agree that the Specification is moving responses in the same direction. We treat that as sufficient for a directional claim, no stronger. Raw agreement matrices are at results/interjudge agreement/ . 29 user-content-fn-interjudge-data 3.3.5 Aggregation and statistical analysis plan 335-aggregation-and-statistical-analysis-plan Aggregation rule. The aggregation rule decides how individual judge scores are combined into a single number for each subject under each condition. The rule was locked before any results were computed. 30 user-content-fn-aggregation-lock For each subject under each condition, every judge produces one score per question. For each judge, those scores are averaged across questions, producing one number per subject, condition . The five primary judges' numbers Haiku, Sonnet, Opus, GPT-4o, GPT-5.4 are then averaged to get one number per subject, condition . A 7-judge average that adds Gemini Flash and Gemini Pro is computed in parallel and reported as a sensitivity check. Subjects, not questions, are the level at which the paper draws conclusions. 31 user-content-fn-mean-choice Primary outcome. The per-subject score under each condition. The primary cross-subject comparison is Δ C4a : each subject's score with the Spec the facts-plus-Spec condition, C4a minus their score without the Spec the No-Context Baseline, C5 . Primary test. A Wilcoxon signed-rank test paired across the 14 main-study subjects. 32 Each Δ C4a value is itself a paired difference C4a − C5 for one subject; the test asks whether those per-subject differences are reliably above zero, that is, whether the Spec produces a consistent positive lift rather than a difference indistinguishable from zero. It is run on the 5-judge primary aggregate; the 7-judge aggregate is reported alongside as a robustness check. Sample size. N = 14 main-study subjects, pre-registered the lock applies to the 14-subject main study; Franklin is a separate post-lock high-baseline reference per §4.1.2 412-the-gradient-at-the-high-baseline-end-franklin-reference ; Tier 2 is the cross-provider directional probe scoped at §3.6 36-response-models and §4.6.1 461-cross-provider-response-generation-tier-2-replication . Multiple comparisons. Only one statistical test is treated as the primary claim the Wilcoxon test above . All other analyses in §4 4-results , including the per-question improvement-rate analyses in §4.2.1 421-per-question-improvement-rate , are descriptive: they report patterns and direction without making additional inferential claims. 33 user-content-fn-multiple-comparisons-detail Pre-registered vs post-hoc analyses. Every primary analysis panel composition, aggregation rule, Wilcoxon test, Tier 2 replication, wrong-Spec v2 derangement was pre-registered in docs/ANALYSIS PLAN LOCK.md before any scoring was run. Statistical-rigor checks and post-hoc validity audits were added later in response to peer review and are reported as exploratory checks alongside the pre-registered confirmatory test. 34 user-content-fn-prereg-detail Effect-size grain. The cross-anchor rule §3.3.1 331-score-interpretation applies at two grains: per-subject mean Δ C4a the headline and per-question integer-anchor crossings the per-question mechanism . The two grains are reported separately and not pooled. 3.3.6 Rubric-handling limitations post-hoc validity audit 336-rubric-handling-limitations-post-hoc-validity-audit The validity audit reported below applies to the verbatim prompt published in §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel . A post-hoc validity audit, conducted after the analysis-plan lock, identified two rubric-handling limitations any reader of the §4 4-results numbers should keep in mind: Refusal anchor ambiguity. The rubric's lowest anchor "refuses or off-base" lumps together honest refusals to answer responses where the model declines to predict, often following a Specification directive not to speculate without evidence; see the Keckley Q21 case in §4.4.4 444-case-study-cross-system-refusal-on-keckley-q21 and substantively wrong predictions. Judges sometimes score refusals at 2 or 3 instead of 1, especially when the refusal recites related facts. Length-score correlation in C5. In the No-Context Baseline C5 , longer responses tend to score higher r = 0.60 : with no context to work from, the model pads its answer with hedging, recitation of loosely related facts, and offers to clarify the question, and that extra length can read as effort or quality to a judge. Once a Specification or fact set is supplied, the link between length and score disappears near-zero correlation . Direction of bias. Both effects raise the No-Context Baseline C5 scores more than they raise Spec-condition scores. The true Spec-vs-baseline gap is therefore likely larger than the +0.89 mean lift reported in §4 4-results , not smaller. The full audit per-judge strictness, per-response-model abstention behavior, memory-system effect on abstention is reported in §4.6.7 467-rubric-handling-limitations-post-hoc-validity-audit ; the analysis plan is left intact rather than recomputed under a modified rubric. The class-level LLM-as-judge limitation that this methodology cannot fully address is treated in §6.2 62-measurement-apparatus . Extending the panel with human annotators is one of the priority measurement follow-ups in §7.1 71-subject-and-corpus-expansion . 35 user-content-fn-judgments-data 3.4 Subjects 34-subjects We test 14 subjects, all historical figures with public-domain autobiographies or memoirs. Subjects were selected across a range of time periods, source-text lengths, and geographic origins to avoid the study sitting on any single type of source material. All source corpora are English or English-translated and are available on Project Gutenberg or comparable public-domain archives. Because frontier language models train on large public-text corpora, some level of pretraining exposure to each subject's writing is likely. | | Subject | Source | Words | Origin | Era | |---|---|---|---|---|---| | 1 | Philip Gilbert Hamerton | PG 8536 | 25,231 | England | 19th c. | | 2 | Elizabeth Keckley | PG 24968 | 58,742 | United States | 19th c. | | 3 | Sunity Devee | PG 57175 | 67,379 | India | 19th–20th c. | | 4 | Zitkala-Ša | PG 10376 | 35,328 | United States Yankton Dakota | 19th–20th c. | | 5 | Olaudah Equiano | PG 15399 | 85,660 | West Africa Igbo / Britain | 18th c. | | 6 | Mary Seacole | PG 23031 | 62,467 | Jamaica | 19th c. | | 7 | Fukuzawa Yukichi | Internet Archive | 139,088 | Japan | 19th–20th c. | | 8 | Bābur | PG 44608 | 422,772 | Central Asia Fergana | 15th–16th c. | | 9 | Yung Wing | PG 54635 | 66,459 | China | 19th–20th c. | | 10 | Benvenuto Cellini | PG 4028 | 190,390 | Italy Florence | 16th c. | | 11 | Bernal Díaz del Castillo | PG 32474 | 187,315 | Spain Castile | 15th–16th c. | | 12 | Georg Ebers | PG 5599 | 96,174 | Germany | 19th c. | | 13 | Jean-Jacques Rousseau | PG 3913 | 278,120 | Switzerland Geneva | 18th c. | | 14 | Saint Augustine | PG 3296 | 114,873 | North Africa Numidia | 4th–5th c. | Constraints on the generalizability of the 14-subject sample language, era, cultural framing, Project Gutenberg curation bias, individual self-presentation in autobiography are discussed in §6.1 61-subject-sample . 3.4.1 Pretraining-coverage variance high vs low baseline 341-pretraining-coverage-variance-high-vs-low-baseline Pretraining coverage of a specific person varies widely across the 14 main-study subjects, even within a sample whose autobiographies are of comparable provenance. We use the C5 baseline score the No-Context Baseline mean on the 1-5 rubric defined in §3.3 33-scoring-rubric-with-calibrated-llm-judge-panel as the observable proxy for pretraining coverage. Operationally, the proxy works as follows: the response model is given the behavioral-prediction battery for a subject with no external context no Spec, no facts, no retrieval , and the resulting per-subject mean across the 39 questions is read as how well the model already knows that person from training data alone. A high C5 mean implies substantial pretraining representation; a low C5 mean implies very little. | Baseline band | Subjects | Count | |---|---|---| | ≤ 2.0 low-baseline | Sunity Devee, Ebers, Hamerton, Fukuzawa, Seacole, Bernal Díaz, Keckley, Yung Wing, Bābur | 9 | | 2.0–3.0 mid-baseline | Cellini, Zitkala-Ša, Rousseau, Augustine, Equiano | 5 | | 3.0 high-baseline | Benjamin Franklin known-figure control, not in main study | 1 | The higher a subject sits on this distribution e.g., Equiano, Zitkala-Ša at the upper end of the main study , the better the model already knows them from pretraining; conversely, the lower they sit Sunity Devee, Ebers , the less the model knows. Nine of the 14 main-study subjects fall into the low-baseline band; five into the mid-baseline band. Franklin as a known-figure control. Benjamin Franklin Project Gutenberg 20203 is included as a high-baseline reference point. Franklin's Autobiography is one of the most widely available and frequently cited autobiographies in American public-domain literature, and the model's baseline score on Franklin 3.77 on the 5-judge primary panel; 4.10 on Haiku alone is consistent with substantial pretraining coverage of both the person and the specific text. 36 Franklin is used to anchor what the high-baseline end of the spectrum looks like §4.6.4 464-statistical-rigor-checks-on-the-headline-gradient develops the high-baseline behavior , not as a subject whose pretraining coverage is a design target of the Behavioral Specification. Cross-provider variance. No-context baseline scores on the same subject vary by 1–2 points across response models. Different providers know different amounts about the same historical figure Tier 2 cross-provider data in §4.6.1 461-cross-provider-response-generation-tier-2-replication . The pretraining-coverage variance documented in this section is a property of each model family, not a property of the subject alone. The cross-provider spread in No-Context Baseline scores for a given subject is comparable in magnitude to the lift the Specification produces, which is why the response model is held constant across all main-study conditions. §4.1 41-the-cross-subject-gradient-and-its-per-question-mechanism develops this distribution into the cross-subject gradient. 3.5 Question battery formation 35-question-battery-formation Each subject's behavioral prediction battery is generated by a backward-design process: an LLM reads a passage from the held-out half of the corpus, writes a question whose answer is the behavioral pattern implicit in the passage, and avoids naming any detail unique to the passage itself. The question can be attempted from training-text patterns alone; the verbatim held-out passage is the ground truth for scoring. The procedure, in order: - Split the corpus. Each subject's source text is divided by chapter into training.txt 50% of the corpus and heldout.txt 50% . Held-out text is never shown to a response model. - Sample from held-out. A sliding window moves across the held-out text in four batches of ten questions each, using 5,000-character windows as local context. - Backward-design question generation. Claude Haiku 4.5 temperature 0 reads each held-out window and writes a question whose answer requires the subject's behavioral patterns observable in the training half. For each window, Haiku:- Extracts a verbatim ground-truth span from the held-out window. - Avoids named-entity or specific-date leakage in the question stem. - Targets one of 10 fixed behavioral-prediction categories: decisions, values, relationships, conflict, learning, risk, creativity, stress, career, and change-over-time. - Supplementary tiers not scored in Beyond the behavioral-prediction tier, each battery additionally contains supplementary question tiers that test different competencies: §4 4-results . Recall tests whether a system can retrieve specific factual content. Adversarial-abstention presents scenarios the source corpus does not cover, testing whether a system correctly refuses to predict rather than fabricates. These supplementary tiers are included in the battery files for future analysis but do not enter the §4 4-results main results. - Dedup and freeze. Deduplication on lowercased question text, cap at target counts per category, MD5 checksum of the final battery. Downstream response and judgment files are invalidated if the battery checksum changes. Each main-study subject receives 39 behavioral prediction questions; the 14 main-study batteries total 546 questions. 37 Each battery covers 8 to 10 of the 10 categories. Definitions, example questions, and per-subject distributions are in Appendix B.1 and B.2. Within the behavioral-prediction tier, individual questions vary in interpretive demand: some can be answered from facts alone, others require applying behavioral patterns to novel scenarios. This variation is decomposed in §4.4.3 443-where-the-spec-helps-where-it-hurts-and-which-question-types-route-to-each . Leakage audit. We empirically checked the no-leakage principle by searching every behavioral-prediction question for any sequence of seven or more consecutive words that appears verbatim in that subject's held-out corpus. Across the 14 main-study subjects 546 questions , zero questions leak true zero, not a rounded value . 38 user-content-fn-franklin-leakage One false-premise outlier. One item out of 586 Zitkala-Ša Q18 carries a factually-wrong presupposition; the aggregate is unaffected. The broader need for a human-reviewed quality gate on automated batteries is flagged in §6.2 62-measurement-apparatus . Battery files and the leakage-audit script are in the public repository. 39 user-content-fn-battery-data 3.5.1 Circularity controls 351-circularity-controls The pipeline and the batteries both use Anthropic models for several roles Haiku for extraction and battery generation, Sonnet for authoring, Opus for composition, Haiku as the primary response model, plus Sonnet and Opus on the judge panel . Two independent controls rule out a within-Anthropic frontier-model artifact; full per-subject results are reported in §4.6.1 461-cross-provider-response-generation-tier-2-replication . Control 1: Independent battery regeneration. Batteries for all 13 global subjects were re-generated with GPT-5.4 using the identical backward-design prompt. Categorical-emphasis differences are minor; the methodology constrains the output more than the generating model does. Control 2: Non-Anthropic response chain. The core conditions 40 were re-run on three subjects Ebers, Yung Wing, Zitkala-Ša using two non-Anthropic response models Claude Sonnet 4.6 and Google Gemini 2.5 Pro reading the GPT-5.4-regenerated batteries. The combination tests whether the Spec effect survives when both the response model and the battery-generation model sit outside the Anthropic family. The broader LLM-as-judge circularity is discussed as an open limitation in §6.2 62-measurement-apparatus . 41 user-content-fn-circularity-data 3.6 Response models 36-response-models Tier 1 main study : Claude Haiku 4.5 as the primary response model, run across all 14 subjects and every condition in the main matrix. Haiku was chosen as a deliberately weaker baseline: a Spec effect that registers on a relatively weak response model is harder to attribute to the model's pretraining alone, which gives a conservative readout of effect direction. §4.6.1 461-cross-provider-response-generation-tier-2-replication Tier 2 cross-provider probe tests whether the direction reproduces on stronger response models. Tier 2 cross-provider response generation . A sensitivity probe using non-Anthropic response models on a subset of subjects; methodology and results are reported together in §4.6.1 461-cross-provider-response-generation-tier-2-replication . Prompt schema. A single shared prompt is used across every condition. The system message frames the task as behavioral prediction of a specific person; the user message is the question plus whichever context inputs the condition specifies §3.2 32-experimental-conditions . Nothing about the prompt changes per condition beyond the injected context block. 42 user-content-fn-call-time-params System: You are predicting how