ARC's "Outperforming Random Sampling" explained

wpnews.pro

Written as part of a FIG Fellowship under Eleni Angelou's supervision.

*I've spent some time with ARC's recent blog post, Competing with Random Sampling. I think it contains some interesting ideas. *

Unfortunately, those ideas are captured in formalisms that might intimidate anyone without the patience for some mathematics.

So here's a more intuitive explainer. Special thanks to Wilson Wu for careful reviewing and nitpicks. All errors my own.

I'm interested in this work because it attempts to set a clear goal for mechanistic interpretability.

Previous goals for mechanistic interpretability have included such catchy phrases as "complete reverse-engineering of neural networks" and "producing human-understandable explanations of neural network behaviour."

Lovely as these are, they aren't clear. ARC are clarifying the goal by looking beyond the explanations themselves to what we actually want the explanations for. They capture this in a formalism that IMO strikes accurately at some key weaknesses in our attempts to develop a science for AI safety.

I'm hopeful that this approach could lead to new opportunities to automate and scale interpretability.

I've found this analogy to be useful to get the overall structure of the argument:

Imagine we have a box containing nasty things. We'll call it Pandora's box.

This box is sealed shut by a four-digit combination lock. We'll call this Pandora's combination lock.

Unlike a normal combination lock, we don't know whether this lock actually opens. It may be that no combination (or even multiple combinations) of digits will open the lock and release the box's contents.

We want to know whether this lock will open and if so how likely it is to open. How do we test this?

The simple answer is to go through each combination of digits and observe whether the lock opens. This requires us to test different combinations, but it does give us a definitive answer on whether the box will open or not.

This is a random sampling procedure, albeit not actually random. It is highly effective but often inefficient because it takes a long time to check every single combination.[1]

But imagine we could study the lock using X-rays.

Let's say we do this and we find that the lock will only open when the first digit is set to 8.

We've managed to reduce the search space substantially. Instead of combinations, we now only have to try .

This mechanical knowledge of the lock is the mechanistic explanation.

Note how this reduced search space means that we can determine much more quickly whether (and with what frequency) the box will open. Applying some prior mechanical knowledge means that we outperform the brute random sampling method, which searches over all combinations.

I'm going to expand the analogy and introduce some notation, to bring us closer to the post's characterization.

Let's imagine that instead of a combination lock we have some complex function that takes in four digits and then opens the lock or remains shut.

For simplicity, we denote the set of all four-digit combinations as and simplify the function output as either one or zero. If the output is one, the box unlocks. If the output is zero, it remains shut. We want to characterize this function-lock's behaviour more precisely than just "does it open or not." We want to ask: "How likely is this function-lock to open?"

For this we need the number of times the lock opens, divided by the number of possible combinations. We can express this in terms of the expected value of , defined over the input space of possible combinations.

This gives us the average output over all possible combinations.

For an actual four-digit lock in this setup, this value is . With this expectation value, we capture the likelihood of the lock opening and releasing the box's contents.

This is a short aside on estimation. It should make some of the formal conditions that turn up later a bit more motivated. Skip if you only want the intuitions, or if you already know some statistics.

In reality, we can't check all possible inputs. We have to settle for an estimate of the true expectation value.

The standard tool is random sampling. We draw N inputs at random, run them through the system, and take the average outcome. Call this estimate . Because depends on which inputs we happened to draw, it is itself a random quantity: a different batch of samples yields a different .

To judge how good G is, we use the mean squared error, i.e., the expected squared gap between our estimate and the truth (I use *p *to denote the true expectation value .):

.

For random sampling this takes the following form: The only knob we control here is . More samples, smaller error.

However, an MSE value in isolation tells us nothing. We need the MSE to be small with respect to the size of . This produces problems in the case of rare events.

As a way of seeing this, consider a "lazy estimator." This is someone who tells us that the function-lock simply won't open, i.e., that it has an expected value of zero. This is not an informed guess based upon observations but simply an "locks open rarely so it's basically zero" inference.

This guess of zero has a very small MSE (relative to the true expected value of ____ if we assume the lock opens on exactly one combination). However, it is not based on observations and requires no work. It is not a good estimate of .

This is the real problem with predicting extremely rare events via random sampling. The difficulty is not that the variance is large — the variance is itself minuscule. The difficulty is that the trivial estimator is already extremely accurate in absolute terms, so any real method must work hard just to clear that floor.

We're now approaching the full characterization given in the blog post. Leaving our combination locks behind, we're going to start thinking about actual neural networks and what it means to explain them mechanistically.

We can treat our neural network as a complicated function, exactly analogous to the function-lock mentioned above.

In the post, the problem is initially set up with a model and some classifier that reads 's outputs and classifies them as catastrophic (1) or non-catastrophic (0).

In our combination lock scenarios we combine these into one function. The difference is trivial so we will continue with the combined system for now.

The denotes the architecture of the network, whether a transformer, a CNN, or something else.

The denotes the parameter settings of the trained network. We can imagine the network starting out with random settings of these parameters (the random initialization) and gradually converging to some specific settings as training grows structure in the network.

In place of our four-digit combinations, we have an input dataset. Our overall aim is to estimate the expectation value of the model + detector system over the entire dataset of possible inputs. This quantifies how dangerous our model is.

We now turn to the mechanistic explanation hinted at in the first analogy. Recall that if you know that the first digit of the combination must be an 8, you can reduce the search space by a factor of 10.

Intuitively, we can view this as reducing the number of samples needed to achieve the same sampling error. We can therefore obtain a more accurate estimate of the expectation value.

In the post, the mechanistic explanation is denoted by . This is fed into an estimator , along with the parameter settings of the model and a tolerance parameter expressing how much error we can tolerate in our estimate of the expectation value.

We now have enough components to state the Matching Sampling Principle:

In words, the principle says this:

For each architecture and all possible parameter settings of that architecture, there exists an explanation we can feed into our estimator to estimate (to within some desired tolerance) the expected output of the network + catastrophe classifier system. There are a few more constraints:

Length Constraint: the explanation must be short, where short means the explanation length should be on the order of the size of the model, as characterized by the number of parameters.

Runtime Constraint: the estimator must be able to estimate the expectation value within (where is the time it takes to run a forward pass of the model). To achieve a squared error on the order of using random sampling, we need approximately samples, which means forward passes through the model. A mechanistic estimator must be able to match this.

Accuracy Constraint: The mechanistic estimation of the expectation value must be competitive with that obtained under random sampling. That is, the sampling error must be less than .

We require these conditions to hold for all tolerance parameters .

The final requirement is that the explanation must be mechanistic. ARC do not currently have a formal definition of mechanistic, citing this as an active area of research. They offer instead the following statement:

"...loosely speaking, we mean that estimates the expected output of **** **deductively, based on the structure of **."

This mechanistic requirement should also exclude any estimations obtained via sampling, as such explanations trivially satisfy the matching part of the Matching Sampling Principle.

There is a neat idea beneath the MSP. It is this:

If we can develop a mechanistic explanation that is able to match random sampling over all possible parameter settings, then that explanation will allow us to considerably outperform random sampling on most realistic networks. Random sampling does not make any assumptions about the structure of the system being sampled. This is both a positive thing (because it applies universally) and a negative thing (because it cannot exploit any structure where it exists).

Any network that can achieve a task will exhibit a lot of structure, and hence we can expect that an estimator that uses this structure will outperform random sampling, and will in fact outperform it at a rate proportional to the amount of structure there is.

There is a loophole in the above formulation, which the post addresses.

It'll help to bear in mind that what we have here is a formal description of what it means to possess a mechanistic explanation. The following caveat may seem impossible for practical reasons, but we are working in full generality and so must avoid all loopholes, however impractical.

It's valuable to check whether the definition we have picks out all and only those entities that are genuine explanations. Remember we want a "something" that allows our estimator to estimate the model's expectation value, such that:

Let's say hypothetically that we had the statement "The expectation value of this model is 0.42" and that this statement is correct of the model. Here we are not concerned with how we obtained this correct answer. We might've computed it offline over a thousand years. We might've spoken to an oracle.

This statement is (1) below the specified length, (2) allows a suitable estimator to output the estimate in a suitable runtime, and (3) is completely accurate. Is this therefore the mechanistic explanation we're looking for?

Well no. This isn't a structural explanation, nor a deductive one. This is the loophole in the above formal specification of a mechanistic explanation.

To get round this, the post uses a patch in the form of a context variable c. We update the model description such that it now takes as input the data and context .

Similarly, the EV becomes . The explanation is first fixed and then the context is chosen at random.

To try and perform the same trick as before, we now have to write down the expectation value taking into account this extra context. We could simply rewrite our explanation as a giant lookup table mapping pairs to expectation values, but if there are lots of possible values the context can take then this look-up table will exceed the length constraint.

A truly robust explanation must work across all possible contexts, and the most effective way to do that is to represent the actual causal structure of the model.

The post makes a final addition to the formulation, it suggests how we might start to develop these explanations of the model to feed to the estimator.

In essence, they propose that we obtain explanations by letting them develop alongside out network, so whilst our network is being trained from its random initialization into a structured model, our explanation is also being grown via reference to that same developing structure.

This is very much aligned with the broad approach that Developmental Interpretability takes, and I would be interested to see how that field can enhance or otherwise fill out some of the details in ARC's formal definitions.

Here I'll follow up on a few interesting conceptual and philosophical takeaways from the work.

Recall the post's rough definition of "mechanistic":

"...loosely speaking, we mean that estimates the expected output of **** **deductively, based on the structure of **."

In a sense, a mechanistic explanation or prediction is based upon the structure that generates some observed behaviour. Inductive methods, such as random sampling, only attempt to track the patterns that emerge in that behaviour.

The problem of induction goes way back and has motivated many diligent people to write a great many dense books.

Without going too deep into the weeds, we illustrate the problem of inductive reasoning via a classic example, oft-quoted in philosophy seminars.

Say we look at 100 hundred swans and find they are all white. Inductive reasoning would have us believe that the next swan we'd see would be white also, and thus all swans outside of our sample would be white.

Crucially, it wouldn't tell us the reason why swans are white, it would merely suggest the existence of a pattern (i.e., whiteness) among some observations (i.e., of swans).

It can also be falsified very easily: one need only observe a single black swan to break the inductive statement of all swans being white.

This matters a lot for us, because we are in the business of trying to predict extremely rare "black swan" events with massive catastrophic outcomes. Random sampling, unless it tests every observation possible, will never give us certainty.

For this reason, inductive reasoning is often held up as the weaker form of reasoning. Deductive reasoning, on the other hand, does not depend on observation and inference over examples but instead explains the phenomena via laws.

The Deductive-Nomological Model of explanation characterizes what we want from deductive reasoning. According to the model, an explanation should consist of the thing to be explained (the "explanandum") and the things that do the explaining (the "explanans").

The explanans include general laws and initial conditions such that the explanandum can be logically deduced. That is, the explanandum follows as a logical consequence of the premises (explanans).

So, given the laws governing a system and its initial conditions, we should be able to derive the phenomenon in question. The explanation succeeds not because we've observed many instances, but because we understand the structure that generates those instances.

ARC's "mechanistic explanation" recaptures this intuition for neural networks. Rather than sampling outputs and inferring patterns (induction), they want to derive expected behavior from the network's structure (deduction).

Another interesting shift is that from human-understandable explanations towards formal, verifiable objects.

As they say in the post:

we are imagining that an explanation of the structure of a neural net is written in some kind of formal language. The explanation could be as large as the neural net itself, and may be as incomprehensible to a human as the neural net. Thus, our goal is not to have a human look at the structure and estimate the expectation of C(M(x)). Instead, the goal is to invent an algorithm that takes as input the explanation and estimates the expectation of C(M(x)) based on that explanation.

This contrasts to almost all other research on neural network interpretability, which aims for a

partial, human understandingof neural nets. Our research is instead aimed atfull, algorithmic understanding.

This is quite a bold departure from previous interpretability work. Most of the canonical definitions would describe mechanistic interpretability as seeking specifically human-understandable explanations of network behaviours.

Although the idea of formally specifying what we want from our explanation is not entirely novel (see, for example, here), this is perhaps the first instance of an established research organization orienting around an essentially non-anthropomorphic basis in their interpretability work.

This leaves some questions open, questions that will no doubt be had in many other disciplines in which AI is starting to dominate the research landscape. The extent to which we trust the formal object, as well as the verifier that checks it for us, will be significant.

However, there are also significant advantages to working with a fully formal definition. ARC's aim in developing mechanistic explanations of networks is simply to estimate the expectation value of a model, given some context and inputs, more quickly than a random sampling procedure would.

This is comparatively simpler and more scalable than "human-understandable." It provides an objective metric by which to assess explanations. There is a famous joke of the Copenhagen Interpretation of Quantum Mechanics, that in defining the act of observation as producing wavefunction collapse, it leaves unspecified where exactly observation happens. For if assume it to remain in superposition until observed in some measurement device by a human, we have to ask whether that human should recognise the phenomena, and from there what qualifications they have to be able to observe (do wavefunctions only collapse for PhD students, or undergrads?).

Arguably we don't want similar such variability in mechanistic interpretability, if we can avoid it. It ideally wouldn't matter whether the person reading the attention patterns knows how or where to spot induction heads reliably, and ARC seem to be proposing one way in which we can start to standardize our interpretative work.

This naturally leads to the other two advantages of working in this paradigm. Using a metric will hopefully increase the automatability (and therefore the scalability) of interpretability work. Mechanistic interpretability has long been plagued by the problem of working only in small systems on simple tasks. Some of that has undoubtedly been attributable to a lack of unambiguous metrics that we can try to optimize towards. ARC, in providing a clear one, are orienting towards automated interpretability searches as an attempt to catch up with increasingly complex models.

This is a substantial reframing of what it means to have an "explanation" or an interpretation, one that I am not fully on-board with but that does answer to some of the field's current weaknesses. On ARC's re-orientation, an explanation is simply that which allows us to better estimate our model's outputs over its set of possible inputs, subject to some formal constraints.

I've been asked to caveat this. Everything is of course relative and the inefficiency of random sampling really depends on what alternatives are available. For processes that do not involve lots of structure [i.e., cryptographic or (pseudo)-random processes], random sampling is as good as it gets. In tasks where lots of structure is available (such as our well-designed combination lock), random sampling is pretty poor compared to being able to look inside and make assertions based upon internal structure.

source & further reading

lesswrong.com — original article One-Pager Brief on Pangram Labs Extinction risk is not the right first sentence Independent alignment of language models

ARC's "Outperforming Random Sampling" explained

Run your AI side-project on zahid.host