Toward a Kantian refutation of Agent Foundations

A LessWrong post argues that the goals of Agent Foundations (AF) are so far-fetched that progress has not reduced the distance to them, suggesting the goal may be unachievable. The author proposes a principled refutation of AF using a Kantian structure of abstraction levels, which could either refute AF or point to unexplored directions.

This post is a Cunningham's law draft https://gwern.net/about long-content , less than 50% finished, in some parts mere notes. Consider a waiting until this notice has disappeared to read a more coherent post, or b criticizing it with a focus on what would be right, not just what is wrong. I haven't strongly made my mind yet, so at this stage I'm particularly interested in fundamental criticisms of the goal and framing but of course I also welcome minor corrections . TL;DR: The goals of Agent Foundations seem so far-fetched that all the progress made in the field doesn't seem to have decreased our distance to them that much. One might conclude that the goal is unachievable, the question ill-posed. But even the available rejections of AF seems unprincipled: instead of proving that the task is impossible, 1 https://www.lesswrong.com/feed.xml fn26pxwodmpv4 we simply fail, and do something else instead. Working toward a principled refutation of Agent Foundations might either indeed refute AF or point to unexplored directions, and both outcomes would be helpful information. Kant argued that there are three and only three related concepts of God, definable on three different levels: He argued furthermore that none of three definitions have enough "meat" to allow for a valid proof of its existence, and concluded that everyone should stop wasting their time trying to prove God's existence. I'm not interested in the specifics of Kant's argument; only in the structure: 1. We are dealing with concepts that are to some extent 2 definable without reference to the specific facts of our universe 2. There are different levels of "abstraction from the facts of our universe" in which different definitions can be made 3. These different formulations potentially have connections to each other e.g. one might be a sub-specification of another 4. An attempt can be made at showing that all levels and all possible definitions have been listed 5. An attempt can be made at showing, for each level and definition, that there isn't enough for the kind of proof we are looking for Consider: "Computation" can be defined at the level of what a Turing machine can or cannot produce, at the level of what belongs to different complexity classes, and at the level of what can be computed in our universe. The concept of computation at the topmost level has enough content to allow e.g. for a proof that Turing-machine-computable and lambda-calculus-computable refer to the same thing. At the second level, everything computable belongs to some complexity class. But you can still work on the first level and ignore this extra specification. At the third level, the Cobham-Edmons thesis claims that only polynomial complexity is tractable in our universe. This is a fact about our universe other universe with different properties are conceivable , and at the same time it is a very general fact that abstracts from the specifics of our universe, such as other universes also fulfilling this property while being very different from ours are also conceivable. At the bottom-most level, you need a substrate for your computations, and this adds a lot of specifications to what computing means. Across the different levels, computability is, in same sense, the same object, and in some sense three different objects. 3 https://www.lesswrong.com/feed.xml fn4flky3sws9y Agent Foundations is in search of a paradigm. It's probably worth it to reflect on what exactly a paradigm is. My initial approximation is a self-contained networks of concepts and proofs . A second worthwhile question in the search for a paradigm is: in which level of abstraction does the paradigm live? 4 https://www.lesswrong.com/feed.xml fnuvwvkxszt48 It seems plausible to me that a solution for alignment can be found using several self-contained components, and that not all components are on the same level. But it seems very implausible that we can cobble concepts from different levels in a haphazard fashion. So it might be useful to try to create a taxonomy of levels, to locate current agendas within them, 5 and to see what we are missing. The guiding question to locate the level in which a solution is is to ask: How different are the universes in which this alignment strategy would work? Would this strategy work in a universe with different physics but the same math? Before we explore a tentative taxonomy of levels, let's try to list the concepts that we are trying to understand. There are two basic aspects to what we are trying to capture under the name of agency : a agent knows the universe and an agent acts in the universe . I will separate the two aspects and, for lack of better terms, speak of inductors and interveners . 6 https://www.lesswrong.com/feed.xml fngnmojfz6096 A natural question to ask is: Is every inductor an intervener, and vice versa? The intervener's interventions can be conceptualized with the concepts of coherence, 7 values, I don't know if there are some equivalent concepts for the inductor side. We also have concepts like alignment and control, which seem to be definable on different levels. I will define dualistic as the property of definitions of a connection of not specifying the thing they connect. 8 https://www.lesswrong.com/feed.xml fnw7bbi0028nf The introductory definition of a set as a non-ordered list of elements is dualistic: it allows for operations like union, intersection, etc., without having to specify what the elements are. ZFC, on the other hands, is not dualistic: all ZFC-sets are ultimately definable from the empty set, and having a set which isn't thus definable isn't allowed. The concept of evolution belongs here, as does non-Many Worlds Quantum Mechanics. More relevant to our purposes, Bayesian updating also belongs here, as do Bayesian Neural Networks. How does AF look inside Dualistic Mathland? Some results and questions: The questions that arise in Dualistic Mathland 10 don't have solutions, but in some sense this is irrelevant, because we know that these definitions are unsuitable for our universe: An optimistic scenario would be that something similar happens in the other levels. See upcoming post Nitpicking on Embeddedness Dualistic and Computable Mathland are memorable names trying to facilitate discussion about things which are presumably known to everyone. Now I tentatively propose some levels about which it might be worthwhile to reflect. The first of them is what I'll call the macro-empirical level. By this I mean working at the level of very fundamental empirical descriptions of the universe, which are general enough that researching in them resembles Dualistic/Computable Mathland, but with the advantage of being automatically relevant for our universe. 14 https://www.lesswrong.com/feed.xml fnt4elehgfa The Natural Abstractions https://www.lesswrong.com/posts/gvzW46Z3BsaZsLc25/natural-abstractions-key-claims-theorems-and-critiques-1 and Condensations https://www.lesswrong.com/posts/BstHXPgQyfeNnLjjp/condensation agendas seem to me to fit here. A similar thing seems to apply to the attempts to define agency from Friston's Free Energy Principle https://www.lesswrong.com/w/free-energy-principle . 15 https://www.lesswrong.com/feed.xml fngt8xw42m24t Several results manifold hypothesis, platonic space hypothesis, simplicity bias point to the universe being a model of a mathematical object, in ways beyond the trivial way that underlyies any instance of empirical science. I'll define the platonico-empirical level as the level in which it can be attempted to instrumentalize these facts. Michael Levin https://lexfridman.com/michael-levin-2-transcript/ is probably the best representative of what this could look like, with his references to non-metaphorical agency inside what he calls the Platonic space, with which we can attempt to interact . 16 https://www.lesswrong.com/feed.xml fno94rqh4vdx Research around concepts like CEV https://www.lesswrong.com/w/coherent-extrapolated-volition aim to reach very general conclusions, but depart from the empirical fact of how humans are. I'm unsure of whether this is regular empirical research coupled with speculation, or indeed an additional level in which, beyond CEV, other concepts could be tried. Schelling Goodness https://www.lesswrong.com/posts/TkBCR8XRGw7qmao6z/schelling-goodness-and-shared-morality-as-a-goal also possibly belongs here. Agent Foundations isn't a badge of honor; "not belonging to AF" might simply mean "being a well-posed question with a solution". Thus it is without valuation or particular surprise that all agendas which are simply doing normal science are not part Agent Foundations. The only interesting point is clarifying that empirical and non-prosaic aren't synonymous. For instance, I think Byrnes's agenda https://www.lesswrong.com/posts/4basF9w9jaPZpoC8R/intro-to-brain-like-agi-safety-1-what-s-the-problem-and-why as trying to abstract the mechanism by which which a human intervener isn't properly modeled by a utility function, but instead by what the human imagines their peers thinking of them. It might be that this concept is abstractable, and if so, it could guide research on different substrates LLMs and some architecture that we haven't discovered yet, for instance . In so far, it is doing something beyond looking at current LLMs and trying to understand it, i.e. it's non-prosaic alignment research. But Byrnes isn't trying to locate that particular mechanism, which must sit somewhere in Mathland, through exploration of Mathland. Byrnes is trying to locate it through exploration of our universe more specifically, of human brains and their effects , i.e. doing brain science. And understandably so, since there is no reason to expect that mechanism to be salient in mathland https://www.lesswrong.com/posts/YkqnPFsBdoyprR7Zt/finding-pi-and-g-in-mathland . The previous levels are all more or less agnostic wrt "solving philosophy", i.e. one could for instance work on the Platonic Space without asking the question of what exactly that means. But it is plausible that the confusion regarding this situation acts as a blocker in AF, so that working on clearing this confusion could itself be one way of working on AF. 17 https://www.lesswrong.com/feed.xml fni4pkgyt7t4q A major source of confusion is that humans are embedded, non-dualistic parts of the universe, and et seem to have what Nagel called the "view from nowhere" https://en.wikipedia.org/wiki/The View from Nowhere able to find out truth that is valid also outside the universe. The hierarchy between solving Math 18 and solving Philosophy seems unclear. Solving philosophy seems to gesture at something like Or at the every least having a very convincing argument of why it's very unlikely to be possible. And the crux is precisely to what extent ? There's no such thing as computer science . ... At one end you have people who are really mathematicians. ... In the middle you have people working on something like the natural history of computers-- studying the behavior of algorithms for routing data through networks, for example. And then at the other extreme you have the hackers, who are trying to write interesting software, and for whom computers are just a medium of expression, as concrete is for architects or paint for painters. Hackers and Painters https://www.paulgraham.com/hp.html Paul Graham This question seems central to distinguish the specific things that have been tried from the general strategies that haven been tried and which could be fulfilled with other specific things. In particular, it seems to me there is no vocabulary to distinguish between the MIRI and MIRI-inspired research https://intelligence.org/research-guide/ , and the paradigm such research points towards. This https://www.lesswrong.com/posts/Dt4DuCCok3Xv5HEnG/agent-foundations-not-really-math-not-really-science comments and this https://www.lesswrong.com/posts/FWvzwCDRgcjb9sigb/why-agent-foundations-an-overly-abstract-explanation are great introduction for the second question, but they are busy rejecting criticism of the field rather than, as I intend here, creating the unified vocabulary to sketch a map of all the things that are and could be part of AF, even if those things contradict each other in the specifics. Each concept of an agenda could in theory be on a different level, but if the agenda is coherent, we should expect to find the whole of it nested together, unless the agenda consists on several separable coherent sub-components. From now on I will avoid using the misleading terms agent or agency . I take the idea of separating the different definitional levels of coherence from this exchange: Mateusz Bagiński https://www.lesswrong.com/posts/ceEgAEXcL7cC2Ddiy/anthropic-s-hot-mess-paper-overstates-its-case-and-the-blog coFr3KAJwiMbqoeuM : How well can the entity's behavior be explained as trying to optimize a single fixed utility function?How well aligned is the entity's behavior with a coherent and self-consistent set of goals? To what degree is the entity not a hot mess of self-undermining behavior? a monograph untangling this coherence mess some more would be valuable. it could do the following things: - specifying a bunch of a priori different properties that could be called “coherence” - discussing which ones are equivalent, which ones are correlated, which ones seem pretty independent - giving good names to the notions or notion-clusters - discussing which kinds of coherence generically increase/decrease with capabilities, which ones probably increase/decrease with capabilities in practice, which ones can both increase or decrease with capabilities depending on the development/learning process, both around human level and later/eventually, in human-like minds and more generally 2 - discussing how this relates to AI x risk. like, which kinds of coherence should play a role in a case for AI x risk? what does that look like? or maybe the picture should make one optimistic about some approach to de-AGI-x-risk-ing? or about AGI in general? 3 I claim this definition has the same spirit, but is more accurate, than the usual definitions. See upcoming post Nitpicking about Embeddedness. I.e. is there something like Q-learning in Dualistic Mathland? Like "how to align an AIXI inductor-intervener?" With the possible exception of the ones regarding Multiple Agency, about which I am even more confused than about everything else here. This is relevant for Bayesian updates, Decision Theory, and possibly other concepts. Nitpicking on unbounded analysis https://www.lesswrong.com/w/methodology-of-unbounded-analysis , Yudkowsky writes: If you can't state a program that solves the problem in principle, you are in some senseconfusedabout the nature of the cognitive work needed to solve the problem. This is true if the problem has already been formulated inside one level which is lower than the one that allows unbounded analysis, but not if the problem is formulated there, or vague enough so it's formulable in several levels. "Not being able to solve alignment as defined in Dualistic Mathland" through unbounded analysis" has as little relevance as "not being able to write an algorithm that solves all possible variations chess". The problem is, barely, well-defined to be a question, but not well-defined enough that not having an answer is relevant. And, synonymously with that advantage, the disadvantage that if it turns out that the empirical description was wrong, they might lose some or all of their relevance. I'm planning to write a post about what I see as independent claims which are often presented together, and often under the same name: This would be in application of Stuart Amstrong https://www.lesswrong.com/posts/XrzQW69HpidzvBxGr/affective-death-spirals?commentId=entyDfAzEPoTXDkhu 's advice: Cut up your Great Thingy into smaller independent ideas, and treat them as independent.For instance a marxist would cut up Marx's Great Thingy into several theories . Then each of them should be assessed independently, and the truth or falsity of one should not halo on the others. If we can do that, we should be safe from the spiral, as each theory is too narrow to start a spiral on its own. Same thing for every other Great Thingy out there. Claim 6 seems particularly relevant to research, because it might point to a more general answer to the question of whether every inductor as an intervener. Trying to do something like the Natural Abstractions agenda inside that Platonic Space also seems like something potentially worth trying. One very speculative way in which this could work out: Kant sketched an argument of how every free will should act super-rationally towards other free wills. Unfortunately the concept of free will doesn't seem to be compatible with our deterministic universe. But what if we could convince an ASI that it is a free will in the Platonic Space, and we could do something like proving meta-ethical theorems that the ASI would be legitimately convinced it should obey? Relatedly, it is interesting to note that Kant's defense of free will is much closer to the Block Universe than to Newtonian mechanics. Also mentioned by Kaarel https://www.lesswrong.com/posts/ceEgAEXcL7cC2Ddiy/anthropic-s-hot-mess-paper-overstates-its-case-and-the-blog fn-jtFxEkxcJvEnFezFG-2 in the previously mentioned discussion of the definitional levels of Coherence.