Simplifying Alignment by Expanding Scope

wpnews.pro

This post is crossposted from my Substack, Structure and Guarantees, where I explore how formal verification and related ideas might scale to more complex intelligent systems. This installment begins a sequence on lessons for alignment from the practice of formal verification. Counterintuitively, adding additional layers to a formally verified system can actually simplify the problem of specifying what safe behavior means. This phenomenon may point toward useful structuring principles even for superintelligent systems, or at least provide a productive starting point for alignment brainstorming grounded in concrete engineering experience.

I just summarized the story so far in the approach I’m laying out for building trustworthy artificial intelligence, and now I want to step back and explain more of the key ingredients. To use formal verification to prove mathematically that programs behave as we would like, we’ll first need to explain what we like in great detail. Such explanations are called specifications.

The challenge of getting specifications right is highly related to a high-profile problem in AI safety, alignment. Roughly, both problems are variants of the classic hazard of asking a

However, the situation decades ago was already pretty risky in representative ways. First, human software engineers can make mistakes. More importantly, someone providing an appealing piece of software might actually be out to get us. It wouldn’t be very practical to write all of our own software from scratch, yet dividing up the work exposes us to sabotage by code authors. (And before someone objects that AI coding assistants now make it practical to write all of our own software from scratch, that usage mode just raises the risk of mistakes or sabotage by those agents!) I wrote previously about how agents sharing code with each other forces a solution to this problem, which luckily has already been studied in the old-fashioned setting.

The subject of the current post is a piece of good news about how alignment, or getting specifications right, is easier than it might sound. Across levels of familiarity with the subject, most people would probably agree that we expect alignment to get more challenging as systems grow in complexity. More moving parts means more parts that could contribute to bad outcomes, and it seems fatal to give any part the benefit of the doubt and skip careful specification for it. Luckily, the storied technique of end-to-end formal verification provides a counterintuitive way to harness expanded scope to lower risks of specification mistakes.

This post will take us through a series of steps of enlightenment in system construction. We can start by noticing that software seems awfully hard to get right, so it is worth investing in disciplined quality-assurance techniques. Formal verification, with its focus on rigorous proofs that implementations meet specifications, is an intuitively appealing technique. What could be better than mathematical certainty that a system behaves itself?

We already set the stage for the first objection: proof only helps us when it proves the right theorem, the specification for an artifact. If we got the specification wrong, then we may wind up worse off, thanks to a false sense of security from a misguided but complete proof. Nonetheless, assume for the sake of argument that we have created a formal specification that captures all of our true requirements for a system.

Let’s say that system is a complex piece of software. The gradually enlightened skeptic notices that software doesn’t just run on its own. There is always some programming-language implementation underneath, and what if that component has a bug? Then behavior can be changed arbitrarily, invalidating all of our original proof work. But why stop there? What about the computer hardware that the program runs on? This reductio can go on for as long as we like, perhaps winding up in quantum mechanics.

With end-to-end formal verification, we choose to stop the “what ifs” at some level and then prove all of the implicated components as a unit. Engineering costs are typically most reasonable when this verification is done modularly, which means that every natural component has its own specification and proof, which we later compose into one full-system specification and proof. Here is a diagram of a natural layered system decomposition, which follows a common pattern of successively translating high-level code into lower-level code as we proceed down the diagram. (We will return to discuss the intermediate specification boxes in the next section.)

While thinking about aligning superintelligence has by necessity been almost entirely theoretical and perhaps even philosophical, end-to-end formal verification has been concretely possible for decades, and we can learn about the broader challenges and potential by studying examples. One of my own related projects is the verified IoT lightbulb, with a unified proof of all digital parts of a simple system to control a lightbulb via an Internet connection. The bottom left of the following photo shows a A follow-on project confined itself to the software side but covered more-comprehensive functionality, including a nontrivial cryptographic protocol, for the verified garage-door opener. Following the latest fashion, this

These sorts of projects are very satisfying in pursuing end-to-end formal verification far-enough to respond to many “what if” questions like I indicated above. However, the informed observer notices a problem that seems fatal at first.

Let us now turn to the layers of the first diagram labeled as specifications. Such specifications serve as interfaces between layers. Each one effectively lays out a contract that two layers promise to follow in cooperating. The examples in the diagram are: A typical mid-layer specification of this kind has two qualities that, taken together, may seem deeply troubling. First, a bug in such a specification can wreck the validity of the whole proof effort. Second, these intermediate specifications can easily be much longer and more complex than the specifications of particular top-level programs. Think of a book-length definition of a popular programming language compared against a one-page write-up about a limited-scope but important program.

There is further challenging subtlety here than just the chance to “get the spec wrong.” Often layers of a system will be formally verified using different approaches and tools, which do not share a specification language. A given specification may be written separately in the two languages, raising the specter of inconsistency amongst what are meant to be two renderings of the same specification.

How much harm could come from inconsistency? Let’s take a popular example of undefined behavior in programming languages. The idea here is that a language has rules that all programmers must follow, and breaking the rules “voids the warrantee” and allows the language implementation to behave however it likes. How could such a punitive contract be a good idea? The problem is that language implementors want to include automatic use of optimizations on behalf of programmers, but whether an optimization is safe to apply typically depends on aspects of program behavior that are

For instance, if a program includes a sequence of 100 elements and the program tries to read the 101st element, that operation would be considered undefined behavior in C and related languages. The point is that reasonable programmers should agree that reading out-of-bounds in a sequence is a bug, and why should the language implementation be created to work around programmer bugs? Of course, the typical programmer rejoinder is that avoiding bugs is hard, and guard rails are appreciated! Nonetheless, languages like C are designed for such a performance-obsessed crowd that they tolerate the risk of undefined behavior. How does undefined behavior tie back to our risk of specification inconsistency between two adjacent layers? The trouble would be if the compiler and the software above it were proved against different definitions of what constitutes undefined behavior. Then the software above might stick to defined behavior by its own standards, yet some of that behavior might be considered undefined by the compiler, allowing it to misbehave arbitrarily. For instance, the application might assume that reading the 101st element of a sequence of length 100 returns some safe default value, whereas the compiler considers the behavior undefined and allows itself to make arbitrary changes to any code downstream of such an out-of-bounds access – including removing explicit safety checks that appeared in the application code. Moreover, the more layers we add, the more chances we have to make similar mistakes with interface inconsistency. Considering that accumulating layers is a key technique to manage complexity in engineering, it would be quite a shame to pick up a disincentive against it.

Do these risks doom the whole project of coordinated formal verification of different parts of a system?

No! In fact, adding additional layers at higher or lower levels can even reduce opportunities for uncaught specification mistakes. The next level of enlightenment is realizing that integrating the proofs of all layers of a system helps us catch all consequential specification disagreements. There could still be bugs in intermediate specifications, but if we manage to prove the system overall, then those lingering bugs turned out not to matter for our top-level objectives. If there were a serious-enough specification disagreement, then either the proof of the provider of that abstraction would fail from below (e.g. a thoroughly unrealistic promise isn’t actually realized in the code), or the proof of the consumer would fail from above (e.g. an assumption about the layer below is too weak to allow proof of this layer’s specification). A diagram can help explain what happens.

The middle layers of the system have become untrusted in the sense of computer security, meaning that lingering bugs can’t spoil the primary guarantees we are after. Not only do we not trust implementation artifacts like the compiler and CPU, but we also avoid trusting the specifications of the programming language or hardware instruction set. In mathematical parlance, they only show up in

The takeaway procedural message here is that, in doing end-to-end verification of a system stack within a single proof system, only the top and bottom specification layers remain trusted. In this way, we can often formally verify more-comprehensive systems with much less risk of misspecification than for less-comprehensive systems. By the way, this perspective is spelled out in more detail in a position paper from a broader project I was involved with. Another important aspect covered in that paper, and that I’ll return to in later posts, is the value of carrying out all proofs in a common formal system and tool ecosystem.

These lessons may generalize to problems of recursively self-improving superintelligences. Challenges in getting systems’ top-level specifications right will remain, but we can sidestep worrying about properly encapsulated internals. The next post turns to one important potential objection: computing stacks have been formally verified assuming perfect knowledge of the lower-level abstractions they run on, while real-world AI agents will need to deal with uncertainty and even adversarial platforms that they must run on.

source & further reading

lesswrong.com — original article KISS AI Safety The current bottleneck is political will, not research A Mechanistic Explanation of Prompt Injection (and why you should study roles)

Simplifying Alignment by Expanding Scope

Run your AI side-project on zahid.host