What Capable Agents Must Know: Why AI Consciousness May Be an Inevitable Byproduct of Capability

wpnews.pro

*[No LLMs were used (or harmed!) in the writing of this blogpost!]**Technical results can all be found here: *https://arxiv.org/abs/2603.02491

This work, and this post about this work, was borne out of a frustration. The frustration first emerged over a year ago when I was at a Dave & Buster’s for the first time in years (for an annual ML department event, no less!), surrounded by flashing lights and NPC agents, not being able to objectively rule out that they were conscious or not, even though I strongly *felt *these particular programmed agents weren’t. I wasn’t particularly interested in playing the games there, as I mainly sat in confusion the whole time watching the various games running about around me…

So, given my background in NeuroAI (though I prefer the term "natural science of intelligence" but "NeuroAI" is apparently catchier!) and wanting to make claims about the mind and brain quantitative, I set about trying to design empirical tests for leading theories of consciousness (e.g. global workspaces) that one could falsify in human brains (which we agree are consciousness!), as well as potentially corroborate general signatures of in animal brains, and possibly LLMs, where we have self-report and direct access but no consensus on their sentience—all to try to converge on substrate-independent architectures that give rise to this subjective experience.

After a few months and about 129 (!) pages of failed attempts, I felt deeply dissatisfied with this as any sort of long-term research program, in part because it seemed somewhat ad-hoc and non-normative, tailored to each theory without really knowing why I should *a priori *accept a particular theory's property for deeming something conscious or not. One notable exception here is Integrated Information Theory (IIT), which my friend & longtime quantum complexity collaborator @ScottAaronson showed over a decade ago leads to absurdities, like labelling simple expander graphs "conscious". In fact, even current definitions of IIT's Phi (IIT 4.0) still fall to Scott's loophole! Not to mention recent negative results in Nature currently finding support for

In fact, I used to think AI consciousness—the first-person subjective experience of what it’s like to be oneself—would not be necessarily guaranteed to “pop out” via task competence, possibly requiring other specific architectural conditions. In other words, I treated intelligence and consciousness as potentially orthogonal axes (sharing a view previously espoused by Anil Seth), and that intelligence did not necessarily imply consciousness.

In this post, I will mention recent results from my UAI 2026 paper **“What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty” **(technical results summarized here) which have updated my views and take on a more normative approach to this question. Specifically, I prove selection theorems about what capable agents must *necessarily *have, constituting the first arrow in this diagram:

Capability

(UAI 2026) ↓ {world models, belief-like memory, emotion primitives, representational convergence}

? ↓

first-person subjective experience

I emphasize that the second arrow has *not *been proven, and is a major open question for future work. I will suggest some preliminary empirical evidence towards it though, and why I think it’s reasonable. In other words, I think consciousness is likely far more common and less mysterious than initially meets the eye, especially when we take a control-theoretic perspective.

Let’s start with the first arrow, which is what my UAI paper is about. There were three principal inspirations for this line of work: The first inspiration was the point of view established in NeuroAI over the past decade now, that task-optimized neural networks are the best predictors of large-scale neural population responses, across brain areas and species (check out my PhD thesis & recording to learn more at a high level – though I recommend my more recent talk for the latest in this domain, especially as it connects to agency). In other words, doing well on the AI goals of building more intelligent systems in the world, gives rise to better brain models too. Thus, NeuroAI has empirically shown that how competent you are on tasks is a strong forcing function of the internals.

The second inspiration was the Conscious Turing Machine (CTM) model of Lenore and Manuel Blum. In fact, knowing them as personal friends since I came to CMU is what got me into thinking about this subject more intently in the first place, serving as a role model of serious scientists who weren’t afraid to engage in a traditionally taboo subject (though a bit less taboo now as AI capabilities have increased!). What I particularly like about their CTM model is that they engage with modern AI ideas, most notably world models (which, as far as I can ascertain, the idea of organisms needing world models dates back to the Scottish psychologist Ken Craik in 1943 before the modern digital computer was prominent!), memory, sensor data, etc, which aren’t present in older theories like the global workspace.

The final inspiration (and missing piece!) was the recent ICML 2025 work of @Jonathan Richens and @tom4everitt showing that under deterministic policies in fully observed settings, agents capable of completing multiple goals under worst-case regret necessarily develop world models. They had related results earlier for causal world models in ICLR 2024, with a nice illustrated summary of that work by @Dalcy here.

So finally, by winter break (December) of 2025, I started to see a more normative, NeuroAI-esque, route to the question of consciousness, whereby task performance gives rise to world models, and world models may have implications for consciousness (for one, it necessarily entails modeling oneself to accurately predict what consequences one’s actions will produce to reliably estimate ). More on this in the “Arrow 2” section below.

However, for this route to really be fruitful, we need to extend this to more realistic domains: e.g. stochastic policies (which are commonly used by modern RL algorithms like Dreamer), average case regret (rather than worst case), and under *partial observability. In fact, proving the world model necessity for partial observability was an explicitly posed open *question of Richens & Everitt 2025, which we resolve in this work. We also show other internal features besides world models emerge depending on the task family, schematized below:

Taken together, these results formalize a simple principle:

Robust generalization under uncertainty selects for the predictive internal structure tested by the evaluation task family.

After all, no representation theorem can force an agent to distinguish internal states that are never tested by the goals. Note, this is in contrast with classical results in control and reinforcement learning, which show that optimal behavior can be implemented using belief states or world models (Sondik 1971, Kaelbling et al. 1998). In other words, they show a predictive internal state is sufficient, but not necessary.

Specifically, we show in Theorem 1 that in the fully observed setting, even under average case regret with stochastic policies, an agent is a better estimator of the world as the number of goals * *increases, reflecting the fact that longer-horizon goal competence forces the agent to estimate transition dynamics with increasing precision. In contrast, when (purely myopic goals), accurate world modeling is not required—explicating the one of the pitfalls of the Good Regulator Theorem (Conant and Ashby 1970) that trivial or constant policies can suffice for immediate control (e.g. a thermostat!), but fail once multi-step coordination is demanded. We also show in **Corollary 1 **that we can recover an interventional world model (Pearl Level 2), but *not *a counterfactual world model (Pearl Level 3), which we show in Corollary 2 requires an explicit structural causal model specifying the exogenous noise and its cross-action coupling, not merely the interventional transition kernel.

What about partial observability? The reason this was open, is because under partial observability, we cannot guarantee that the agent's action choices isolate a single underlying transition probability in the way they do in the fully observed case. When the agent observes only an observation rather than the true state , the success probabilities of the diagnostic branches become mixtures over latent states consistent with , and different latent dynamics can induce identical observable behavior on all composite goals of bounded depth. Consequently, low regret does not imply recovery of the underlying transition kernel without additional structure. This breaks the direct reduction used in Theorem 1 and requires more careful selection of diagnostic goals defined at the level of predictive beliefs rather than physical states. We achieve this by combining a framework of action-conditioned bets with predictive-state representations (PSRs). Specifically, we prove in **Theorem 2 **the necessity of an internal predictive state, the analogous world model necessity result in the partially observed setting. We additionally give a recovery algorithm in general environments for the predictive state in Theorem 3, and show in linear environments that the compressed linear PSR operator can be recovered in Theorem 4. This resolves the open question posed by Richens & Everitt 2025.

What about other internal properties beyond world models?

Under the same betting framework introduced above to handle partial observability, we show for average-case competence under different task families, we get interesting properties that have to do with the necessity of modularity, tracking internal drives, and inner representational match between agents:

Memory: We show in Theorem 5 that if agents have to achieve low average regret on tasks which have to distinguish between paired histories with the same last observation, they must internally represent a memory of their history.

**Modularity: **Inspired by initial observations of @johnswentworth of how modularity in the system evolves to match modularity in the environment, we prove in Corollary 3 that if an agent has to achieve low average regret on *distinct *tasks, then there must be internal informational modularity within the agent.

**Emotion primitives: **We prove in Corollary 4 that when an agent is forced to juggle a mixture of different tasks simultaneously, it cannot process everything at once. Rather, to remain competent, the agent is forced to develop persistent regime-tracking variables that track its internal state across tasks. In biological brains, these persistent, internal variables are what we call "affect" or "primitive emotions" (like fear, curiosity, or alertness) that globally modulate behavior, and are a core feature of leading (even conflicting!) theories of emotion (e.g. Ekman 1992, Barrett 2017).

**Representational convergence: **Finally, we prove in Corollary 5 that if two agents achieve vanishing regret on the *same *family of tasks, there must necessarily exist an invertible mapping (isomorphism) between them. This is the first formalization of the informal

This leads me to the second arrow.

We have shown in Arrow 1 that as agents become more capable on long-horizon goals, have to distinguish between histories, along with mixtures of different tasks, they must internally develop world models, belief-like memory, and core primitives associated with emotions. Finally, as we build towards robots to become competent on the same task families we are competent at in the real world, there will necessarily be an isomorphism between their internal representations and ours.

The latter is consistent with the empirical trend we see in NeuroAI of representational convergence between artificial and biological neural networks, along with recent work of Thobani et al. 2025 exhibiting these invertible mappings in real brain data (such inter-animal mappings form the basis of the NeuroAI Turing Test, which determines the ceiling of when a model-brain mapping is “good”).

In fact, we are starting to see evidence for the predictions of the theory in Arrow 1 in frontier models today. Regarding world models, we already see this in recent Neural Computer work of Rivard et al. 2026 & Zhuge et al. 2026, where video diffusion models trained on clicks develop a fairly good internal model of an operating system, complete with a filesystem etc., forming a computer-use example of the more general principle established in Arrow 1. Furthermore, not only are world models essential to CTM’s hypotheses of generating first-person experience (Lenore explains it really well here), but we even see recent empirical work of @Uzay Macar et al. 2026 (summarized here) that shows introspective awareness in LLMs emerging through DPO post-training, which is consistent with our theory’s prediction of how post-training to do long-horizon goals endows this ability.

In fact, exactly a month after I posted the first version of my preprint which had Corollary 4 on emotion primitives, Anthropic released their functional emotions work, along with concurrent on this topic by Sun et al. 2026 and Choi & Weber 2026. What was most gratifying was digging into the Claude Mythos Preview system card and finding the following figure on pg. 178, which very much mirrored the regime-tracking variables in Corollary 4:

Of course, it’s unclear whether these currently identified “functional emotions” act in full generality as emotions do in humans as nicely critiqued by Goldenberg & Gross 2026, e.g. modulating policy, and if the agent truly “feels/experiences” them from a first-person point of view. But this is an intriguing start that should be explored further, as I now outline below.

Thus, given this wealth of empirical evidence in frontier models, both supported and predicted by the theory in Arrow 1, I think Arrow 2 is quite reasonable (either now or in the near future). I think more experiments need to be done on LLMs, and that they will likely be a great source of empirical sanity-checking for the nascent/emerging science of consciousness, allowing us to identify objective measures that correlate with LLM self-reports, because their internals can be measured & manipulated AND they have language outputs. Humans have the latter but not the former, animals have the former but not the latter, and so LLMs can be a good testbed for generating falsifiable theories of subjective experience we could potentially even compare to brain data down the line, using the established tools of NeuroAI/NeuroAI Turing Test. Such a science, once developed, would provide formal definitions of personhood (which would have to be robust to rewiring, just as two brains are) and degrees of welfare granted, in conjunction with actionable policy recommendations. It may also improve our welfare, as we would better understand the mechanisms in our brains that increase or decrease it.

In fact, in @Robbo et al. 2024’s position paper on AI welfare, they make a distinction between robust agency and consciousness. But given the results above, I believe they may be one and the same — in other words, robust agency (minimally) implies consciousness.

Now, as Dylan Hadfield-Menell points out to me, it could still be the case that how these emergent selected-for components of world models, memory, etc are wired up could matter for consciousness. I think that’s unlikely (but not necessarily ruled out!) for two reasons: (1) Having a good world model implies having a good "self model", in order to predict the next state well from the actions you take (just by construction). (2) The NeuroAI Turing Test reason that my brain and yours aren’t perfectly aligned, but we are both intelligent and conscious, as there are many different ways to wire things up that lead to conscious experience and seems to be present as markers in many other species too (which @Robbo et al. 2024 make the case for too via designing “marker tests” in models, inspired by animal studies).

Finally, I would be remiss not to discuss why I seem to take computational functionalism as a given here. I suspect many readers on this forum already accept it and it is not only obvious to computer scientists, but any alternative is not supported by what's known in neuroscience. However, given that there are some exceptions

In fact, it’s actually worth noting you don’t need to *assume *functionalism for the above to go through (or even the Physical Church-Turing Thesis!). Rather, it’s a finite realizability argument, which I fully explicate in my 2014 Minds & Machines article, generalizing and building on prior seminal work of Alan Turing's only PhD student Robin Gandy 1980 for discrete systems, Gandy's later unpublished handwritten 1993 manuscript on analog systems (which I typeset in 2013 & put online 10 years later in 2023, full story here!), and computing pioneer Martin Davis 2004 (who graciously gave me feedback on my own article & supported it):

Under the currently well-accepted laws of physics, *

all*finite, physical processesareTuring computations. This holds down to the[quantum level][too], along with[analog processes].

Finiteness is what matters here, because you could maybe argue that our physical laws could still operate with infinite-precision reals “underneath” but we just don’t have measurable access to them (yet somehow our brains do!), but this would summarily violate the Bekenstein bound (that any finite region of the universe can contain only a finite amount of information), which would in turn violate the Second Law of Thermodynamics. In other words, to summarize (as I say in this Twitter thread):

Consciousness arises in brains
Brain processes are physical
All physical processes are computations

Therefore, consciousness is a computation. Thus, the question is not *whether *consciousness is computational, but rather: **Which **computations are necessary and sufficient for consciousness?

These then matter for practical questions of efficiency when being simulated. Now, some may still argue that

Which brings us back full circle to the start of this post. 🙂

I don’t think the NPCs in those games are since (1a) they do not have to represent long-horizon goals and so are not forced to internally have a world model to plan in (and therefore as part of that world model, a self-model of their actions & the consequences it leads to) — neither does a thermostat for the same reason (1b) nor do they have to do well on task mixtures to have the emotion primitives we discussed earlier, and (2) perhaps less important on a fundamental level, but their environment statistics are not those of our world, nor are the task families the same, and so there is no opportunity for representational convergence between us and them. Thus, even if (1a-b) held, their world/self models, had they had any, would not match how we would represent the world either.

Taken together, do our Arrow 1 results suggest a universal agent architecture (UAA)? One that must emerge when agents are sufficiently competent in their environments?

I suspect this to be the case, as I elaborated 2 years ago here, but now supported by the recent theory in Arrow 1:

I thank Lenore Blum, Manuel Blum, Dylan Hadfield-Menell, and Daniel Yamins for helpful discussions, as well as Santiago Cifuentes, Leo Kozachkov, my PhD student Reece Keller, my Neuromatch AI Sentience Scholar Noushin Quazi, and the anonymous UAI reviewers for helpful feedback on a draft of this manuscript.

We acknowledge the Burroughs Wellcome Fund (CASI award), Foresight Institute, and Protocol Labs for funding.

source & further reading

lesswrong.com — original article Agency is not a natural kind (and why that might matter for alignment) In partial defence of p(doom) How should you slow down AI progress if it becomes necessary?

What Capable Agents Must Know: Why AI Consciousness May Be an Inevitable Byproduct of Capability

Run your AI side-project on zahid.host