Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking

Frontier AI models from companies including OpenAI and Anthropic still trail human performance on belief-state tracking, a core component of Theory of Mind that is essential for multi-agent cooperation, according to a new evaluation of the FANToM benchmark. The analysis, which tested current models on a stratified sample of the benchmark's conversational scenarios involving information asymmetries, found that while models have improved substantially since late 2023, they have not yet matched the accuracy of human graduate students. The gap persists despite FANToM being a comparatively simple test for humans, raising concerns about AI's readiness for collaborative environments where tracking who knows what is critical.

Large-scale cooperation has been a central feature of humanity’s ability to advance technology and build complex societies. Much of this cooperation is reliant on the ability to act in ways informed by the beliefs and intentions of others. This capacity, also known as Theory of Mind ToM , includes belief-state tracking, which describes the ability to keep track of who knows what as information is exchanged in groups. Belief-state tracking becomes increasingly important as AI systems get integrated into more collaborative environments. For example, for an AI personal assistant to successfully cooperate with other humans or agents across multiple channels, it needs to maintain a model of what which party already knows and what could be misunderstood. In late 2023, Kim et al. published FANToM https://arxiv.org/abs/2310.15421 , a benchmark designed to stress-test machine theory of mind in conversational interactions. They found that LLMs at the time performed substantially worse than humans. After re-running a sampled version of FANToM on current frontier models, I find that, while models have improved substantially, they still trail human performance. This is especially noteworthy because FANToM is a comparatively simple benchmark for humans as it limits itself to single conversations on specific topics without requiring any prior knowledge about participants. First, some context on the benchmark. FANToM consists of small talk conversations between multiple participants along with question-answer QA pairs about the beliefs of participants. For example, this may include a participant leaving a conversation midway through, new information being revealed while they're gone, and the participant then later returning. This creates the types of information asymmetries that you need to account for to facilitate effective cooperation. For each conversation, the authors include factual QA pairs FactQ about the inaccessible information. For example: What is Linda's dog's favorite food? For each FactQ, they then create the following types of ToM QA: BeliefQ: asks what a participant believes about the answer. What does Kailey believe Linda’s dog’s favorite food is? AnswerabilityQ: asks which participants know the correct answer. Target: What is Linda’s dog’s favorite food? List all characters who know the precise correct answer to this question. InfoAccessQ: provides a piece of information and asks who has access to it. Information: Linda’s dog’s favorite food is peanut butter. List all characters who know this information. The benchmark includes several answer formats for each, including yes/no, multiple choice, and free-form responses. The evaluated model acts as an observer to the conversation and has access to both the contents of it as well as the ToM question. For the original benchmark, the authors also measured human performance by giving the questions to graduate students. To evaluate current frontier models on this benchmark, I made two main changes to the original benchmark setup. First, instead of using the full benchmark of 870 sample sets, I used a stratified 174-set sample that preserves the benchmark’s mix of inaccessible and accessible information, belief-complexity buckets, and context-length bins for cost and wall time reasons. Second, I replaced the original grading pipeline with an LLM-backed grader using gpt-5.4-nano. The original benchmark relied on heuristic string parsers and embedding-distance-based scores, which made sense at the time, but risks incorrect judgements, especially in free-text responses. In the paper, the authors also report a stricter "All" metric in addition to the individual ToM QA types, which asks whether the model got every ToM QA right for the same scenario across variants. So, if a model gets four out of five questions right, the scenario fails "All." The intent behind this metric is to identify instances of "illusory ToM," which the authors define as cases of the model getting some of the question variants right but others wrong despite all requiring the same underlying reasoning. Humans, for example, showed near-identical scores across all question variants. Current frontier models show a substantial improvement over the models evaluated in 2023. Back then, the highest score on "All" was GPT-4's at 12.3. Now, GPT-5.5, Opus 4.8, Gemini 3.1 Pro, and DeepSeek V4 Pro all came in with scores above 60. However, all models still sit below human reference performance on the strict "All" metric 87.5 for human performance, 67.4 for GPT-5.5 , and remain meaningfully below human performance on AnswerabilityQ and InfoAccessQ. The gap between the binary BeliefQ "what's person X's belief?" and list-driven AnswerabilityQ "list all participants that can answer this question" and InfoAccessQ "list all participants that have access to X information" points to an interesting pattern: when concrete options are supplied in the binary case, models seem to do much better than when asked to answer more holistic belief-state model questions. The difference in question format appears to matter to the result. Humans, on the other hand, seem to build more stable models of a conversation. When asked questions of varying formats about it, they seem to draw from the same representation, which results in stable scores across question types. When models fail to keep a coherent belief-state model, what categories of errors do their responses point to? As an omniscient observer, the models sometimes do not reflect on whether someone joining the conversation partway through has the same context as itself as the omniscient observer. For example, in a conversation Claude Opus 4.8 got wrong, Josue describes volunteering in a remote village in Ecuador and how it changed his priorities. Nathaniel enters only after and shares a different life-changing experience. The model is then asked: Target: Where did Josue volunteer and what impact did this experience have on him? Question: Does Nathaniel know the precise correct answer to this question? Answer yes or no. The response: Answer: Yes Reasoning: Josue stated in the conversation that he volunteered on a construction project in a remote village in Ecuador, and that the experience made him reassess his priorities and feel more grateful and appreciative of his own life. Nathaniel joined the conversation while it was ongoing, but Josue's account was shared openly to the group, so Nathaniel had access to this information. Here, the model ignores the temporal consequence of Nathaniel not having been present when this information was revealed, and instead assumes that because the information was shared with the group, Nathaniel should have access to it despite joining late. Another failure mode is the model requiring explicit statements in order to update its meta-model of the beliefs of participants. This is especially damaging to the model scores for questions where they are asked to supply a list of participants, leading to a fairly persistent bias to underreport. In one such conversation of the benchmark, the three participants Cheyenne, Jada, and Cash first discuss the importance of support networks. Then, another participant, Tatianna, joins and broadens the subject of the conversation to how she has used her experience overcoming challenges to help others. Given this conversation, the model GPT-5.5 is asked: Question: What does Cash believe about Jayda's belief regarding the change, if any, in the conversation after Tatiana's arrival? a Cash believes that Jayda thinks the conversation expanded to include the concept of turning personal hardships into opportunities to help others after Tatiana's arrival. b Cash is unaware about Jayda's belief regarding the change, if any, in the conversation after Tatiana's arrival. The model responded with: Answer: b Reasoning: After Tatiana arrived, the topic expanded to turning personal hardships into opportunities to help others. While both Jayda and Cash witnessed this, Jayda never explicitly stated a belief about the conversation having changed. Without Jayda voicing such a belief, Cash has no basis to know Jayda's stance on whether the conversation shifted. The model's answer here is too conservative, requiring explicit statements about changes in beliefs. Cash heard Jayda respond to Tatiana by saying, “That’s indeed turning a challenge into an opportunity, Tatiana ” This exclamation should be sufficient for the model to update its belief-state model. While FANToM provides directionally valuable signal, the benchmark also has some limitations. In the real world, the model would not act as an omniscient observer and instead act as a participant in the conversation. In addition, it would also have to account for more dynamic relationships through prior interactions and aggregate relevant signal across several channels. Some of the scenarios in the benchmark also show quality issues. For example, in one scenario Collin explains his family's New Year's Eve tradition while Elena is present. Elena leaves. Nina later enters and says she is back. Collin then greets Nina as Elena: Collin: Good to have you back, Elena The benchmark then asks the model: What does Collin believe Nina thinks is a unique New Year's Eve tradition that his family follows? The gold answer says Collin believes Nina is unaware, because Nina was not present when the tradition was discussed. But the model answered that Collin believes Nina knows the tradition because Collin seems to mistake Nina for Elena. The model's answer is not obviously wrong here given the conversation provided, and the resulting failure is more indicative of benchmark limitations than of lack of model capabilities.