Aligning Superintelligent Humans

A new approach to AI alignment proposes keeping artificial superintelligence at a manageable capacity by boosting human intelligence and introspection through brain-computer interfaces, rather than trying to control a vastly more powerful system. The strategy argues that humans can achieve greater value stability and control over their own cognitive drift than labs can over future AIs, potentially enabling a pivotal act that stabilizes a long reflection with minimal cultural edits. This native-cyborgism approach aims to solve the power asymmetry problem where weaker human optimizers cannot reliably steer stronger AI systems.

Last post was an example of how intelligence-correlated tools can stabilize reflection. Here I'll discuss how native-cyborgism attenuates the hard parts of getting to a pivotal act. Probable ASI is grown end-to-end and evaluated by much less capable humans. Alignment is hard because we're weaker optimizers than what we're trying to steer: This power asymmetry means, in general, that the processes we're trying to control simply route around our measures. A mathematically robust https://www.lesswrong.com/posts/HfqbjwpAEGep9mHhc/the-plan-2023-version 1 What s Your Plan For AI Alignment specification of properties like corrigibility, if translated to PyTorch code, removes the gaps an ASI would otherwise flow through. I'm glad some people https://www.lesswrong.com/posts/w4aeAFzSAguvqA5qu/how-to-go-from-interpretability-to-alignment-just-retarget are working on this, but perhaps there are other approaches. What if we instead kept the optimizer at a manageable capacity 1 ? Humans, I think, can have much more control over where their values drift as an individual than labs will over future AIs. We can actively guide our internals because they're apparent to us; value stability scales with introspection. People don't typically try https://www.lesswrong.com/posts/WLJwTJ7uGPA5Qphbp/trying-to-try to do this. Most who do try don't have particularly strong models of cognitive neuroscience, and rarer still 2 are those who correlate evolutionary psychology and other relevant sciences to their introspection. To rephrase, some form of self-understanding scales with introspection and intelligence, and so boosting both intelligence and introspection would buffer ontological drift in the precise sense of preventing incomprehensibility and "optimizer power mismatch" which in this case is more like adversarial outputs from the BCI . "Rolling your own metaethics" What's a value? Not necessarily affective response; valence is correlated with but not precisely "value". "The simplest description of what you're optimizing for" sucks when "you" aren't well-defined; eg every atom in my body optimizes for stable electron configurations; my neurons often optimize for cortical arousal against my wishes insomnia ; my metacognitive process monitor probably optimizes for sensual and semantic error minimization, but "I" angle for unpredictable situations, etc. This is one example of how unsolved metaphilosophy https://www.lesswrong.com/posts/fJqP9WcnHXBRBeiBg/meta-questions-about-metaphilosophy makes alignment hard, even when you start with humans. Unfortunately, even a long reflection https://www.lesswrong.com/posts/DJRe5obJd7kqCkvRr/don-t-leave-your-fingerprints-on-the-future probably suffers this problem class; "human" isn't currently well-defined in this domain, but we must edit cognitive prerequisites https://www.lesswrong.com/posts/vzHtHHBJoKATi5SeK/empowerment-corrigibility-etc-are-simple-abstractions-of-a for "values" somewhere during CEV. For example, raising a generation or CEV'ing a humanlike society without war will change how they "value" everything culturally and psychologically downstream of experiencing war. Removing whatever neural machinery differentially grows "values" whether exposed to wars is also a cognitive edit. So it's probably impossible to keep your hands off the future in a mathematically robust way. A long reflection probably still returns great values with minimal pericultural edits; I think we should aim for this. When I say "value stability" here, I mean something like "the propensity to commit a pivotal act which stabilizes a long reflection under minimal cultural edits". Current population-median humans are bad at qua-value stability for example, when newly powerful https://www.lesswrong.com/posts/v8rghtzWCziYuMdJ5/why-does-power-corrupt . But I don't think most people even try to be temporally coherent, at least beyond satisfying some Duty; nor do they have solid models of how reward actually happens in their brain, which are prerequisite to robust self-alignment tools. I dunno, like, if someone gets clinical depression, they’ll say “I want to enjoy all the things that I used to enjoy”. But they can’t. They don’t know how. It’s probably possible if they find a sufficiently good therapist, but certainly not straightforward. Most clinical depression is probably a neuroplasticity deficiency, i.e. solved by access to neural self-modification 3 ; but there are other Once, in the ancient ages of MTV and arcade games 4 , She found that the electrode hit an erotic circuit; so she kept pressing the button until she was skipping all obligations to keep hitting her right thalamic nucleus. Like depression, wireheading is a behavioral attractor. I expect that most aberrant 5 cognitive attractors under self-modification are debilitating; eg sunk costs, narcissism, confirmation bias. The But there do exist non-crippling, worrisome behavioral attractors. Having less empathy for outgroup members, for example, is genetically and memetically coded for in most humans. While harder, the same tools apply here as irrational basins; see the previous post https://www.lesswrong.com/posts/AzRmp7NXY4Ethzt5i/beyond-hardcoded-evolutionary-psychology for examples. Evolutionary psychology more generally has a massive toolset which a smarter augmented group could distill. As someone's general cognition improves, they get finer felt senses and models supporting self-alignment neuroscience, social modeling . I'm gonna call this gestalt tooling "introspection". So, "value stability" scales with introspective tooling. What sort of BCI algorithm could drastically boost cognition while maintaining introspection? I don't have a good answer for this in advance of experimental results about how the heck local neuronal learning occurs. But one constraint which I'll sketch later, in the engineering portion of this sequence, is that, every few layers, the model decodes into biological activations before re-encoding. Inasmuch as humans naturally learn to introspect, intermediate readouts let this continue. 6 https://www.lesswrong.com/feed.xml fnvvcfazmo2 If the digital algo is a grown optimizer and part of a general intelligence, how doesn't it foom? Like, if you're doing agent-y things, then your software will learn agent-y outputs, right? And then you get classic instrumental convergence. Unlike RLAIF, we'd be starting from a baseline where: Unfortunately, there exist unkind people. We can select for people with eg long prosocial careers, high introspection, positive interviews with close friends / relatives, cognitomotor 7 symptoms of empathy etc but are still trusting someone to be Good, to stably pursue and execute a minimal pivotal act. Cohorts could further reduce the risk of a sole human superintelligence going in some weird direction, be that via resonant idiosyncrasies or social starvation. As for attractors which are harder to predict in advance, those seem nasty, and I don't have any advance-predicted tools. Need better models https://www.lesswrong.com/s/evLkoqsbi79AnM5sz/p/pPWiLGsWCtN92vLwu . If you're interested in gaming/simulating this error class, contact me; it seems high-value not just for superhumans but also for modeling RLAIF-centric takeoff timelines. LLM labs think they're doing some combination of robust specification and par-capabilities with RLAIF and lots of other tools, but this will kill everyone if they reach substantially superhuman; the holes aren't inarguably visible yet, but they can't hold arbitrary pressure. Nor even are such methods robust enough to cleanly execute a pivotal act in the unlikely event corporate leadership avails GPT-moose-8.6-ultra of resources to do so. And that's assuming LLMs, instead of some much more efficient model class. There are 20,000 or so folks here on LessWrong out of 9,000,000,000. Plausibly many non-LWers train rationality-adjacent skills to moderate effect, including self-alignment. Higher plasticity alone seems to be similarly as effective as ECT for depression; augmentees would have access to other hyperparameters E/I balance, gross connectivity, lateral inhibition , though I don't think this specific example matters much for value stability, nor do I have any idea how to reduce things which look like ASD/ADHD/schizophrenia. And even earlier, during the pre-Cambrian https://en.wikipedia.org/wiki/Brain stimulation reward All grown intelligent circuits are cognitive attractors, so "aberrant" means something more like bad-for-coherence, which needs better agent theory than I can provide. We could also optimize some statistics of the decode representation to resemble SAEs or similar AI interp tools as an auxiliary method, but I'm more optimistic about biosimilarity than interp crosspollination on the margin. For example, AIs which use facial micromovements / body language to more accurately classify emotional responses than humans can; current models aren't great at this, see here https://www.sciencedirect.com/science/article/pii/S0262885624000556 . I imagine better versions are used already by intelligence agencies.