fab: how to do (alignment) research at scale

wpnews.pro

Over the last month, I've been working on a project I'm calling fab. [1]Nominally, it's an interface that enables a human researcher to make sense of research produced by many agents running in parallel. I haven't "finished" fab — in fact, I'm stuck searching for the crux of building something like this — but I still think it is worth posting a short explanation of the problem it is meant to solve, and how it tries to address it.

In what follows I'm imagining this being used for automated alignment research. This is not because I see it as a silver bullet. I am just observing that a lot of current empirical alignment work could be automated (and in fact some of it already has been, just not the more open-ended stuff). If we can shift more work to the left through automation, that's a win. I am deliberately assuming humans as the final decision-makers, and asking how far we can get by augmenting human judgement.[2]

Imagine a near future where you can spin up dozens of agents to do research for you in parallel. [3]You loosely specify a question you're interested in, and they do all the legwork: operationalise that question, look at prior work, run quick experiments to get a mechanistic understanding, run bigger experiments (training models, perhaps, or doing white-box interpretability), analyse the results, position the findings in the broader context. Ideally, they all take slightly different approaches, trying to find different "handles" on the question you specified. When they're done, you have on the order of tens of write-ups to review, with the ultimate goal of updating your understanding of that question in light of new evidence.

I think this is actually really hard, for a couple of reasons that have to do with the interplay between how we do research and current agent failure modes.[4]

Attention is the big problem. There are only so many human researchers in a given field. In alignment, there are maybe a few thousand in total, with maybe 30 or so who are highly productive. You don't want these people reviewing LLM slop. I think this is the binding constraint of the entire system, because it dictates how much work you can fan out. In other words, nothing is stopping you from kicking off 100 research agents right now (okay, maybe token spend), but good luck making sense of all hundred research reports.

In reality, of those 100 research agents, many will not produce anything worthwhile. In my experience the most common reasons for that are:

There are mitigations for all three:

Even when these are taken care of, there is a lot of information to review. A side point: the format we've settled on for academic publication is partly a reflection of bandwidth constraints in human reviewers. That need not be the case in the future, if we can offload some of that to agents. Ideally we would not share just the findings; we would record what was tried, what worked & what didn't, what assumptions were made, as well as any resources used (open-science efforts often focus on these points).

fab is supposed to be the layer/interface that converts lots of agent attempts into one human update. Today fab is mostly exploration. I am assuming that a good agent platform exists externally, such that I can spin up persistent background agents with tools, skills, execution sandboxes, durable filesystems etc. Those agents receive a research contract, do some work, and return a specific kind of artefact bundle.

The research contract is a specification of the research question. It is partly structured, but leaves a lot of room for nuance. Here is an example for weak-to-strong generalisation. I get the sense that, with current-generation agents, investing in a good spec has large returns. Saying "go investigate phenomenon X" tends to not work.

The fab artefact bundle is adapted from this paper. It contains results, code, logs, and a report. I also expect the entire agent trace to be at least inspectable, but it is better if there is a rich snapshot of the state so that the agent can be spun up again with the same execution state, trajectory history etc. (this exists today).

fab is explicitly not the data layer, and stays out of how agents might want to structure or persist intermediate work. As long as the artefact shape is correct, the rest is open to experimentation. I put together a lightweight knowledge base repo to accompany fab; here is an example of an artefact from a real multi-agent run investigating what happens to a model's support before and after RLVR. That said, I've noticed that you have to be very careful when you get agents to work on a knowledge base. The default outcome is accumulation; agents often behave as librarians, filing things away. I would like them to consolidate that knowledge, pruning the corpus as needed, updating previous notes, etc. I think this append-only bias mostly falls out of how they're trained, so it's not easy to remove.

I've used fab a few times now with different kinds of agent harnesses. It doesn't yet clear the bar for switching from directly interacting with the agents to this added interface. The few runs I did were just okay — but they only used 3 agents, and the core value proposition is when you fan out more.

A fun idea I had is "FellowsBench": how easy would it be to use fab to replicate work from past Anthropic Safety Fellows? It's an easier problem because the work has already been done, but it tests many of the execution paths you would hit in real-world use (e.g. fan out replications of ablations at the same time). It's also easier than the abstract "do research" question because most of the projects are empirical, so more tractable for agents.

I'm planning to run this at a larger scale and see if that works as a forcing function for the design. If using fab causes me to update on a view, that seems like a win; measuring the delta between using fab and using plain agents seems harder. I don't anticipate the architecture or code generation to be big obstacles here; rather I think nailing the ergonomics is most important.

There is a lot of overlap between fab and agent orchestration. For single-agent work, this looks like coming up with a good harness. For multi-agent work, it's trying to figure out how to get swarms to make useful progress. fab is a bit more like the latter, but in my mind it is subtly different in focus. Similarly, fab overlaps but doesn't try to rehash other efforts:

Asking how fab should work led me to other questions like "how does science happen?" or "how is knowledge produced?". There are a few threads worth chasing here.

Within or across paradigms

It could be that a research system that works well within-paradigm does not work well across them, and vice-versa. Within a paradigm there is typically incremental progress, even though you can't trace a neat graph curving up-and-to-the-right. It is still hard to discover the future, but in retrospect it is obvious that those pieces line up. Discovering a new paradigm is not like this; it is a "shock" to the existing body of knowledge that seems to occur when enough anomalies accrue under that body of knowledge. It's plausible that the kind of work that eventually leads to noticing those anomalies, to putting the picture together, is not the same kind of work that yields incremental progress within-paradigm. (One example that comes to mind is adding epicycles to explain planetary orbits under a geocentric model, when the better model, the new paradigm, is heliocentrism.)

I am more sceptical that you can do this kind of discovery by just ramping up the amount of work you do, though the connection is somewhat subtle. To be clear, I do think that even paradigm-changing research requires a lot of attempts, a lot of shots on goal; in that sense parallelising those attempts isn't bad by default. But the function that tells you whether those attempts are coalescing into a broader paradigm is harder to pin down, and not something I expect to "crack" with fab. This recent write-up is relevant!

Research flows

I am finding out that even though there is an archetypal "flow" for research, it varies quite a lot between fields. Even with expensive training runs in ML our feedback loops are short. Most experiments are computational in nature. That dictates how we iterate, and ultimately how progress gets made. You can view the advances in the last few years as accumulated knowledge from repeated experimentation. The best in the field carry tacit knowledge: heuristics about hyperparameters, intuitions about anomalies, and so on.

The picture is completely different when you're, say, manufacturing viral vectors for gene therapy. Assays take several days. You can't hurry your cells along because you have a deadline, you're mostly operating on their schedule instead. Wet lab work has an added overhead, and added complexity from biology not-quite-behaving according to your model, that dictates how research is done there.

Can fab support both of these things? Other types of research? Should the research workflow be constrained or open-ended? And there is a trade-off here: more structure means more legibility, but less novelty. For fab, if you allow fully free-form outputs from agents, the verification/oversight mechanism completely breaks down, and you get even less out than if you had, say, forced a particular structure that somewhat restricted what the agents could do.

Debate

We might want to use fab to carry out debate, between:

The hope there would be that the debate contains more information than either side presenting its own arguments. Ideally the debate helps to elicit the crux of the research question — sometimes asking the right question is an extremely valuable contribution.

Which fields are productive?

Automated Alignment Is Harder Than You Think makes the observation that there are differences in productivity between academic fields. Part of the explanation is whether it is possible to discard unproductive hypotheses in a given field. If it isn't possible, or if it takes a long time, progress in that field is extremely slow. That implies we need a way to distinguish between productive and unproductive hypotheses. Historically, reproducibility has been a mechanism to do this. There are other methods, but they are more uneven; for example a great mathematician can probably discern through intuition when an approach is fruitful, but a mediocre mathematician might make matters worse.

Overall I think these are fascinating points which we may have to revisit if agents can do science at scale. [5]I have, however, tried to steer clear of the big-big questions, because I don't want to boil the ocean ("before designing fab, I have to figure out science" is not a workable goal). My plan is to move outward from a useful prototype.

Even though this work isn't ready, I wanted to write about it for a few reasons, most of which are selfish:

As in, semiconductor fab. Also, lowercase, because that's cool these days. ↩︎

Delegating the judgement unlocks scale, but takes away alignment. I think scalable oversight has some hairy problems, and I am trying to dodge them if I can by not doing scalable oversight. ↩︎

If that's difficult, imagine you are a principal investigator managing a bunch of capable-but-junior researchers. Or copies of you. Or, in the future, copies of von Neumann. ↩︎ I am assuming somewhat aligned agents, plus a harness that allows us to implement control techniques, which is why the failure modes are pedestrian with respect to the alignment problem proper. ↩︎

Or maybe not think about at all, if the agents are too good. ↩︎

source & further reading

lesswrong.com — original article Exploration: fine-tuning with parameter decomposition ARENA 9.0: Call for Applicants Interstellar conquests: hard race, hide-and-seek and robust protection

fab: how to do (alignment) research at scale

Run your AI side-project on zahid.host