Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models

Researchers at the University of Oxford and other institutions found that large language models (LLMs) develop false-belief task (FBT) performance late in pretraining, dependent on model size and training volume, and improved by post-training. However, FBT performance remains fragile, with non-factive verbs increasing false belief attributions, and situation modeling accuracy precedes but shows incoherence in representing agent knowledge states.

arXiv:2606.28524v1 Announce Type: new Abstract: Recent work suggests that Large Language Models LLMs are sensitive to the belief states of agents described by text, as measured by the false belief task FBT , yet persistent concerns of construct validity remain. We adopt a developmental perspective , tracing the pattern of mental state reasoning behavior -- and likely preconditions for this behavior -- across multiple training stages in the Olmo2 and Pythia language model suites. We find that above-chance FBT performance depends both on model size and sufficient training volume, emerges relatively late in pretraining, and is most improved by post-training interventions SFT, DPO in the condition most diagnostic of mentalizing False Belief, Implicit . However, FBT performance is fragile: consistent with past work, the use of non-factive verbs e.g., thinks increases false belief attributions even in the True Belief condition. To contextualize these findings, we track the emergence of situation modeling : the ability to report on basic factual properties of a described scene. Situation modeling accuracy generally precedes and exceeds FBT accuracy, yet situational representations also prove surprisingly incoherent in certain respects: when asked about the knowledge states of the Antagonist agent -- who always knows the item's true location -- Olmo2 13b is consistently influenced both by the Target agent's knowledge state and the presence of non-factive verbs. Together, these results suggest that larger, sufficiently trained models build partially coherent situation models in a developmentally appropriate sequence, yet display surprising fragility -- highlighting the value of developmental and stress-testing approaches for evaluating LLM capabilities.