What Capable Agents Must Know: Why AI Consciousness May Be an Inevitable Byproduct of Capability

Aran Nayebi's new paper argues that AI consciousness may be an inevitable byproduct of building highly capable agents, using selection theorems to show that robust decision-making under uncertainty requires properties associated with conscious experience. The work challenges the view that intelligence and consciousness are orthogonal, suggesting that as AI systems become more capable, they may necessarily develop subjective experience.

No LLMs were used or harmed in the writing of this blogpost Technical results can all be found here: https://arxiv.org/abs/2603.02491 This work, and this post about this work, was borne out of a frustration. The frustration first emerged over a year ago when I was at a Dave & Buster’s https://maps.app.goo.gl/XVCAFULxDMzSZPgV9 for the first time in years for an annual ML department https://ml.cmu.edu/people/core-faculty event, no less , surrounded by flashing lights and NPC agents, not being able to objectively rule out that they were conscious or not, even though I strongly felt these particular programmed agents weren’t. I wasn’t particularly interested in playing the games there, as I mainly sat in confusion the whole time watching the various games running about around me… So, given my background in NeuroAI https://anayebi.github.io/files/thesis.pdf though I prefer https://x.com/aran nayebi/status/1884604315488399430 the term "natural science of intelligence" https://anayebi.github.io/files/NeuroAgents LabPlanIntro 2024.pdf but "NeuroAI" is apparently catchier and wanting to make claims about the mind and brain quantitative, I set about trying to design empirical tests for leading theories of consciousness e.g. global workspaces https://en.wikipedia.org/wiki/Global workspace theory that one could falsify in human brains which we agree are consciousness , as well as potentially corroborate general signatures of in animal brains, and possibly LLMs, where we have self-report and direct access but no consensus on their sentience—all to try to converge on substrate-independent architectures that give rise to this subjective experience. After a few months and about 129 pages of failed attempts, I felt deeply dissatisfied with this as any sort of long-term research program, in part because it seemed somewhat ad-hoc and non-normative , tailored to each theory without really knowing why I should a priori accept a particular theory's property for deeming something conscious or not. One notable exception here is Integrated Information Theory IIT https://en.wikipedia.org/wiki/Integrated information theory , which my friend https://scottaaronson.blog/?p=9875 & longtime quantum complexity collaborator https://arxiv.org/abs/1408.3193 @ScottAaronson https://www.lesswrong.com/users/scottaaronson?mention=user showed over a decade ago leads to absurdities https://x.com/aran nayebi/status/1899854584928972964 , like labelling simple expander graphs "conscious" https://scottaaronson.blog/?p=1799 . In fact, even current definitions https://x.com/JohannesKleiner/status/1813289771747201063 of IIT's Phi https://arxiv.org/abs/2002.07655 IIT 4.0 https://arxiv.org/abs/2212.14787 still fall https://x.com/JohannesKleiner/status/1816213637557805360 to Scott's loophole https://x.com/aran nayebi/status/1816212678291214651 Not to mention recent negative results in Nature https://www.nature.com/articles/s41586-025-08888-1 currently finding support for In fact, I used to think AI consciousness— the first-person subjective experience of what it’s like to be oneself —would not be necessarily guaranteed to “pop out” via task competence, possibly requiring other specific architectural conditions. In other words, I treated intelligence and consciousness as potentially orthogonal axes sharing a view previously espoused by Anil Seth https://pubmed.ncbi.nlm.nih.gov/40257177/ , and that intelligence did not necessarily imply consciousness. In this post, I will mention recent results from my UAI 2026 paper https://arxiv.org/abs/2603.02491 “What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty” technical results summarized here https://x.com/aran nayebi/status/2029234582034272406 which have updated my views and take on a more normative approach to this question. Specifically, I prove selection theorems https://www.lesswrong.com/posts/G2Lne2Fi7Qra5Lbuf/selection-theorems-a-program-for-understanding-agents about what capable agents must necessarily have, constituting the first arrow in this diagram: Capability UAI 2026 ↓ {world models, belief-like memory, emotion primitives, representational convergence} ? ↓ first-person subjective experience I emphasize that the second arrow has not been proven, and is a major open question for future work. I will suggest some preliminary empirical evidence towards it though, and why I think it’s reasonable. In other words, I think consciousness is likely far more common and less mysterious than initially meets the eye, especially when we take a control-theoretic perspective. Let’s start with the first arrow, which is what my UAI https://www.auai.org/uai2026/ paper is about. There were three principal inspirations for this line of work: The first inspiration was the point of view established in NeuroAI over the past decade now, that task-optimized neural networks are the best predictors of large-scale neural population responses, across brain areas and species check out my PhD thesis https://anayebi.github.io/files/thesis.pdf & recording https://www.youtube.com/watch?v=WED5GPKEv4Q to learn more at a high level – though I recommend my more recent talk https://www.youtube.com/watch?v=5deMwNtBBP0 for the latest in this domain, especially as it connects to agency https://arxiv.org/abs/2506.00138 . In other words, doing well on the AI goals of building more intelligent systems in the world, gives rise to better brain models too. Thus, NeuroAI has empirically shown that how competent you are on tasks is a strong forcing function of the internals. The second inspiration was the Conscious Turing Machine CTM model of Lenore and Manuel Blum https://arxiv.org/abs/2403.17101 . In fact, knowing them as personal friends since I came to CMU is what got me into thinking about this subject more intently in the first place, serving as a role model of serious scientists who weren’t afraid to engage in a traditionally taboo subject though a bit less taboo now as AI capabilities have increased . What I particularly like about their CTM model is that they engage with modern AI ideas, most notably world models which, as far as I can ascertain https://youtu.be/fE5wRn9Rwgo&t=1240 , the idea of organisms needing world models dates back to the Scottish psychologist Ken Craik in 1943 https://www.amazon.com/Nature-Explanation-Kenneth-K-Craik/dp/0521094453 before the modern digital computer was prominent , memory, sensor data, etc, which aren’t present in older theories like the global workspace. The final inspiration and missing piece was the recent ICML 2025 work https://arxiv.org/abs/2506.01622 of @Jonathan Richens https://www.lesswrong.com/users/jonathan-richens?mention=user and @tom4everitt https://www.lesswrong.com/users/tom4everitt?mention=user showing that under deterministic policies in fully observed settings, agents capable of completing multiple goals under worst-case regret necessarily develop world models. They had related results earlier for causal world models in ICLR 2024 https://arxiv.org/abs/2402.10877 , with a nice illustrated summary of that work by @Dalcy https://www.lesswrong.com/users/dalcy?mention=user here https://www.lesswrong.com/posts/Rwjrrmn6LBzHrfen7/summary-of-robust-agents-learn-causal-world-model . So finally, by winter break December of 2025, I started to see a more normative, NeuroAI-esque, route to the question of consciousness, whereby task performance gives rise to world models, and world models may have implications for consciousness for one, it necessarily entails modeling oneself to accurately predict what consequences one’s actions will produce to reliably estimate . More on this in the “Arrow 2” https://www.lesswrong.com/posts/SD9jayFvEctW82Duk/what-capable-agents-must-know-why-ai-consciousness-may-be-an-2 Arrow 2 What if anything might this have to do with the AI Consciousness section below. However, for this route to really be fruitful, we need to extend this to more realistic domains: e.g. stochastic policies which are commonly used by modern RL algorithms like Dreamer https://arxiv.org/abs/1912.01603 , average case regret rather than worst case , and under partial observability. In fact, proving the world model necessity for partial observability was an explicitly posed open question of Richens & Everitt 2025 https://arxiv.org/abs/2506.01622 , which we resolve in this work. We also show other internal features besides world models emerge depending on the task family, schematized below: Taken together, these results formalize a simple principle: Robust generalization under uncertainty selects for the predictive internal structure tested by the evaluation task family. After all, no representation theorem can force an agent to distinguish internal states that are never tested by the goals. Note, this is in contrast with classical results in control and reinforcement learning, which show that optimal behavior can be implemented using belief states or world models Sondik 1971 https://www.jstor.org/stable/169635 , Kaelbling et al. 1998 https://people.csail.mit.edu/lpk/papers/aij98-pomdp.pdf . In other words, they show a predictive internal state is sufficient , but not necessary . Specifically, we show in Theorem 1 that in the fully observed setting, even under average case regret with stochastic policies, an agent is a better estimator of the world as the number of goals increases, reflecting the fact that longer-horizon goal competence forces the agent to estimate transition dynamics with increasing precision. In contrast, when purely myopic goals , accurate world modeling is not required—explicating the one of the pitfalls of the Good Regulator Theorem https://www.lesswrong.com/posts/Dx9LoqsEh3gHNJMDk/fixing-the-good-regulator-theorem MHHkWHWtdFn2hqkCz Conant and Ashby 1970 https://www.tandfonline.com/doi/abs/10.1080/00207727008920220 that trivial or constant policies can suffice for immediate control e.g. a thermostat , but fail once multi-step coordination is demanded. We also show in Corollary 1 that we can recover an interventional world model Pearl Level 2 https://web.cs.ucla.edu/~kaoru/3-layer-causal-hierarchy.pdf , but not a counterfactual world model Pearl Level 3 https://web.cs.ucla.edu/~kaoru/3-layer-causal-hierarchy.pdf , which we show in Corollary 2 requires an explicit structural causal model specifying the exogenous noise and its cross-action coupling, not merely the interventional transition kernel. What about partial observability? The reason this was open, is because under partial observability, we cannot guarantee that the agent's action choices isolate a single underlying transition probability in the way they do in the fully observed case. When the agent observes only an observation rather than the true state , the success probabilities of the diagnostic branches become mixtures over latent states consistent with , and different latent dynamics can induce identical observable behavior on all composite goals of bounded depth. Consequently, low regret does not imply recovery of the underlying transition kernel without additional structure. This breaks the direct reduction used in Theorem 1 and requires more careful selection of diagnostic goals defined at the level of predictive beliefs rather than physical states. We achieve this by combining a framework of action-conditioned bets with predictive-state representations PSRs https://web.eecs.umich.edu/~baveja/Papers/psr.pdf . Specifically, we prove in Theorem 2 the necessity of an internal predictive state, the analogous world model necessity result in the partially observed setting. We additionally give a recovery algorithm in general environments for the predictive state in Theorem 3 , and show in linear environments that the compressed linear PSR operator can be recovered in Theorem 4 . This resolves the open question posed by Richens & Everitt 2025 https://arxiv.org/abs/2506.01622 . What about other internal properties beyond world models? Under the same betting framework introduced above to handle partial observability, we show for average-case competence under different task families, we get interesting properties that have to do with the necessity of modularity, tracking internal drives, and inner representational match between agents: Memory: We show in Theorem 5 that if agents have to achieve low average regret on tasks which have to distinguish between paired histories with the same last observation, they must internally represent a memory of their history. Modularity: Inspired by initial observations https://www.lesswrong.com/posts/JBFHzfPkXHB2XfDGj/evolution-of-modularity of @johnswentworth https://www.lesswrong.com/users/johnswentworth?mention=user of how modularity in the system evolves to match modularity in the environment, we prove in Corollary 3 that if an agent has to achieve low average regret on distinct tasks, then there must be internal informational modularity within the agent. Emotion primitives: We prove in Corollary 4 that when an agent is forced to juggle a mixture of different tasks simultaneously, it cannot process everything at once. Rather, to remain competent, the agent is forced to develop persistent regime-tracking variables that track its internal state across tasks. In biological brains, these persistent, internal variables are what we call "affect" or "primitive emotions" like fear, curiosity, or alertness that globally modulate behavior, and are a core feature of leading even conflicting theories of emotion e.g. Ekman 1992 https://www.paulekman.com/wp-content/uploads/2013/07/An-Argument-For-Basic-Emotions.pdf , Barrett 2017 https://academic.oup.com/scan/article/12/1/1/2823712 . Representational convergence: Finally, we prove in Corollary 5 that if two agents achieve vanishing regret on the same family of tasks, there must necessarily exist an invertible mapping isomorphism between them. This is the first formalization https://x.com/aran nayebi/status/2043348659220205736 of the informal This leads me to the second arrow. We have shown in Arrow 1 https://www.lesswrong.com/posts/SD9jayFvEctW82Duk/what-capable-agents-must-know-why-ai-consciousness-may-be-an-2 Arrow 1 What Capable Agents Must Know that as agents become more capable on long-horizon goals, have to distinguish between histories, along with mixtures of different tasks, they must internally develop world models, belief-like memory, and core primitives associated with emotions. Finally, as we build towards robots to become competent on the same task families we are competent at in the real world, there will necessarily be an isomorphism between their internal representations and ours. The latter is consistent with the empirical trend we see in NeuroAI of representational convergence between artificial and biological neural networks, along with recent work https://arxiv.org/abs/2510.02523 of Thobani et al. 2025 exhibiting these invertible mappings in real brain data such inter-animal mappings form the basis of the NeuroAI Turing Test https://arxiv.org/abs/2502.16238 , which determines the ceiling of when a model-brain mapping is “good” . In fact, we are starting to see evidence for the predictions of the theory in Arrow 1 https://www.lesswrong.com/posts/SD9jayFvEctW82Duk/what-capable-agents-must-know-why-ai-consciousness-may-be-an-2 Arrow 1 What Capable Agents Must Know in frontier models today. Regarding world models, we already see this in recent Neural Computer work of Rivard et al. 2026 https://arxiv.org/abs/2507.08800 & Zhuge et al. 2026 https://arxiv.org/abs/2604.06425 , where video diffusion models trained on clicks develop a fairly good internal model of an operating system, complete with a filesystem etc., forming a computer-use example https://x.com/aran nayebi/status/2043306057259176063 of the more general principle established in Arrow 1 https://www.lesswrong.com/posts/SD9jayFvEctW82Duk/what-capable-agents-must-know-why-ai-consciousness-may-be-an-2 Arrow 1 What Capable Agents Must Know . Furthermore, not only are world models essential to CTM’s hypotheses of generating first-person experience Lenore explains it really well here https://youtu.be/2nQSoiC5VHs?t=191 , but we even see recent empirical work https://arxiv.org/abs/2603.21396 of @Uzay Macar https://www.lesswrong.com/users/uzay-macar?mention=user et al. 2026 summarized here https://www.lesswrong.com/posts/BNMLtuDTNBwGHcnQX/mechanisms-of-introspective-awareness that shows introspective awareness in LLMs emerging through DPO post-training https://www.lesswrong.com/posts/7ruzY5LvBqFBWzyMo/direct-preference-optimization-in-one-minute , which is consistent with our theory’s prediction of how post-training to do long-horizon goals https://x.com/aran nayebi/status/2038019990184755447 endows this ability. In fact, exactly a month after I posted the first version of my preprint which had Corollary 4 on emotion primitives, Anthropic released their functional emotions work https://www.anthropic.com/research/emotion-concepts-function , along with concurrent on this topic by Sun et al. 2026 https://arxiv.org/abs/2604.05655 and Choi & Weber 2026 https://arxiv.org/abs/2604.07382 . What was most gratifying was digging into the Claude Mythos Preview system card https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf and finding the following figure on pg. 178, which very much mirrored the regime-tracking variables in Corollary 4: Of course, it’s unclear whether these currently identified “functional emotions” act in full generality as emotions do in humans as nicely critiqued by Goldenberg & Gross 2026 https://arxiv.org/abs/2606.14742 , e.g. modulating policy, and if the agent truly “feels/experiences” them from a first-person point of view. But this is an intriguing start that should be explored further, as I now outline below. Thus, given this wealth of empirical evidence in frontier models, both supported and predicted by the theory in Arrow 1 https://www.lesswrong.com/posts/SD9jayFvEctW82Duk/what-capable-agents-must-know-why-ai-consciousness-may-be-an-2 Arrow 1 What Capable Agents Must Know , I think Arrow 2 https://www.lesswrong.com/posts/SD9jayFvEctW82Duk/what-capable-agents-must-know-why-ai-consciousness-may-be-an-2 Arrow 2 What if anything might this have to do with the AI Consciousness is quite reasonable either now or in the near future . I think more experiments need to be done on LLMs, and that they will likely be a great source of empirical sanity-checking for the nascent/emerging science of consciousness, allowing us to identify objective measures that correlate with LLM self-reports, because their internals can be measured & manipulated AND they have language outputs. Humans have the latter but not the former, animals have the former but not the latter, and so LLMs can be a good testbed for generating falsifiable theories of subjective experience we could potentially even compare to brain data down the line, using the established tools of NeuroAI/NeuroAI Turing Test https://arxiv.org/abs/2502.16238 . Such a science, once developed, would provide formal definitions of personhood https://x.com/aran nayebi/status/2059606886261956718 which would have to be robust to rewiring, just as two brains are https://x.com/aran nayebi/status/2059667953990111640 and degrees of welfare granted, in conjunction with actionable policy recommendations. It may also improve our welfare, as we would better understand the mechanisms in our brains that increase or decrease it. In fact, in @Robbo https://www.lesswrong.com/users/robbo?mention=user et al. 2024’s position paper on AI welfare https://arxiv.org/abs/2411.00986 , they make a distinction between robust agency and consciousness. But given the results above, I believe they may be one and the same — in other words, robust agency minimally implies consciousness. Now, as Dylan Hadfield-Menell points out to me, it could still be the case that how these emergent selected-for components of world models, memory, etc are wired up could matter for consciousness. I think that’s unlikely but not necessarily ruled out for two reasons: 1 Having a good world model implies having a good "self model", in order to predict the next state well from the actions you take just by construction . 2 The NeuroAI Turing Test reason that my brain and yours aren’t perfectly aligned, but we are both intelligent and conscious, as there are many different ways to wire things up that lead to conscious experience and seems to be present as markers in many other species too which @Robbo https://www.lesswrong.com/users/robbo?mention=user et al. 2024 make the case for too via designing “marker tests” in models, inspired by animal studies . Finally, I would be remiss not to discuss why I seem to take computational functionalism as a given here. I suspect many readers on this forum already accept it and it is not only obvious to computer scientists, but any alternative is not supported by what's known in neuroscience https://x.com/aran nayebi/status/2053524474155388943 . However, given that there are some exceptions In fact, it’s actually worth noting you don’t need to assume functionalism for the above to go through or even the Physical Church-Turing Thesis https://x.com/aran nayebi/status/2053881019774681579 . Rather, it’s a finite realizability argument https://x.com/aran nayebi/status/2053533724999098559 , which I fully explicate in my 2014 Minds & Machines article https://arxiv.org/abs/1210.3304 , generalizing and building on prior seminal work of Alan Turing's only PhD student Robin Gandy 1980 https://www.sciencedirect.com/science/chapter/bookseries/pii/S0049237X08712576 for discrete systems, Gandy's later unpublished handwritten 1993 manuscript https://arxiv.org/abs/2311.09239 on analog systems which I typeset in 2013 & put online 10 years later in 2023, full story here https://x.com/aran nayebi/status/1722302534327701543 , and computing pioneer https://en.wikipedia.org/wiki/Martin Davis mathematician Martin Davis 2004 https://link.springer.com/chapter/10.1007/978-3-662-05642-4 8 who graciously gave me feedback on my own article & supported it : Under the currently well-accepted laws of physics, all finite, physical processesareTuring computations. This holds down to the quantum level too , along with analog processes . Finiteness is what matters here, because you could maybe argue that our physical laws could still operate with infinite-precision reals “underneath” https://x.com/chris percy/status/1933561381514551486 but we just don’t have measurable access to them yet somehow our brains do https://x.com/aran nayebi/status/2046237532623380564 , but this would summarily violate the Bekenstein bound https://en.wikipedia.org/wiki/Bekenstein bound that any finite region of the universe can contain only a finite amount of information , which would in turn violate https://x.com/aran nayebi/status/1933804895267942816 the Second Law of Thermodynamics https://en.wikipedia.org/wiki/Second law of thermodynamics . In other words, to summarize as I say in this Twitter thread https://x.com/aran nayebi/status/1815203808491802997 : - Consciousness arises in brains - Brain processes are physical - All physical processes are computations Therefore, consciousness is a computation. Thus, the question https://x.com/aran nayebi/status/2053080472914206793 is not whether consciousness is computational, but rather https://x.com/aran nayebi/status/2061206166202679687 : Which computations are necessary and sufficient for consciousness? These then matter for practical questions of efficiency https://x.com/aran nayebi/status/1933850276416426171 when being simulated. Now, some may still argue that Which brings us back full circle to the start of this post. 🙂 I don’t think the NPCs in those games are since 1a they do not have to represent long-horizon goals and so are not forced to internally have a world model to plan in and therefore as part of that world model, a self-model of their actions & the consequences it leads to — neither does a thermostat for the same reason 1b nor do they have to do well on task mixtures to have the emotion primitives we discussed earlier, and 2 perhaps less important on a fundamental level, but their environment statistics are not those of our world, nor are the task families the same, and so there is no opportunity for representational convergence between us and them. Thus, even if 1a-b held, their world/self models, had they had any, would not match how we would represent the world either. Taken together, do our Arrow 1 https://www.lesswrong.com/posts/SD9jayFvEctW82Duk/what-capable-agents-must-know-why-ai-consciousness-may-be-an-2 Arrow 1 What Capable Agents Must Know results suggest a universal agent architecture UAA ? One that must emerge when agents are sufficiently competent in their environments? I suspect this to be the case, as I elaborated 2 years ago here https://anayebi.github.io/files/NeuroAgents LabPlanIntro 2024.pdf , but now supported by the recent theory in Arrow 1 https://www.lesswrong.com/posts/SD9jayFvEctW82Duk/what-capable-agents-must-know-why-ai-consciousness-may-be-an-2 Arrow 1 What Capable Agents Must Know : I thank Lenore Blum https://www.cs.cmu.edu/~lblum/ , Manuel Blum https://www.cs.cmu.edu/~mblum/ , Dylan Hadfield-Menell https://people.csail.mit.edu/dhm/ , and Daniel Yamins https://stanford.edu/~yamins/ for helpful discussions, as well as Santiago Cifuentes https://scholar.google.com/citations?user=oZUZmFYAAAAJ&hl=es , Leo Kozachkov https://kozleo.github.io/ , my PhD student https://anayebi.github.io/group/ Reece Keller https://x.com/rdkeller/status/1930687419629600924 , my Neuromatch AI Sentience Scholar https://neuromatch.io/ai-sentience-scholars/ Noushin Quazi https://docs.neuromatch.io/p/eLCCBXYkpgNdRs/Noushin-Quazi , and the anonymous UAI https://www.auai.org/uai2026/ reviewers for helpful feedback on a draft of this manuscript. We acknowledge the Burroughs Wellcome Fund CASI award https://www.bwfund.org/grants/interfaces-in-science/career-awards-at-the-scientific-interface/ , Foresight Institute https://foresight.org/ , and Protocol Labs https://www.protocol.ai/ for funding.