"Alignment is like rocket science" is, like all analogies including great ones [1], misleading The year is 1666, and an asteroid headed toward Earth appears over the skies of England. Different scholars attempt interpretability research on cannonballs, on stones thrown from towers, on balls rolled down planes, but a certain Isaac proposes that what is needed is a more general understanding of the concept of Throwability, so we can backchain which thing we should be throwing at the asteroid to stop it.
Is this an appropriate analogy to understand why someone would insist in doing AF and not prosaic alignment? No [4], because there is no real conflict between cannonball interpretability and Throwability Foundations. At some point one should of course abstract away from cannonballs entirely, but gathering more cannonball data is never going to hurt. The relevance of Throwability for our universe routes through the single fact that matter obeys the laws of Throwability; otherwise it would be just a mathematical curiosity. More important: The laws of Throwability that our universe follows can only be found by looking how matter obeys them. They are, a possible mathematical object among infinitely many others, but we can't find that object mathematically because there is nothing mathematically special about it.
Here is a better analogy:[5] The year is 1666. Isaac has an intuition that few people share or understand: he has been thinking about something he calls "fluxions", which he can't really describe beyond vaguely gesturing at coffee cooling down faster when very hot or pendulums moving faster at the edges of their arc. He thinks some of those fluxions will be very dangerous for England, so he shuts himself in his room and thinks about fluxions all day. Meanwhile, finding things that exhibit the peculiar properties of cooling coffee or swinging pendulums becomes a profitable business, and soon interpretability researchers on coffee and pendulums begin to appear. When asked why he doesn't research coffee, Isaac replies
"I'm not sure coffee fluxions specifically are dangerous. I'm also not sure how many types of fluxions there are. Coffee might be a very small, very idiosyncratic region of fluxion space. I have an intuition that things like the economy or like beliefs are also fluxions, so it doesn't make sense to study the specifics of coffee knowing that other fluxions will be different in everything, except in that essence of fluxions that I don't yet understand. If we had more time, I wouldn't oppose randomly researching fluxion types as we encounter them. But since I fear the danger is imminent, I prefer the wager of trying to find a general theory of fluxions — knowing full well that I'll likely make no progress at all — over legible progress on what are most likely irrelevant coffee fluxions".
In different universes, Isaac's fear comes true or doesn't. Orthogonally, in some universes his vague concept of fluxions points toward differential equations, or toward nothing, or there is something but no general theory of that something. In all those universes, however, the call to "make fluxion theory more coffee-grounded" is meaningless.
*Both images were by Chat GPT with a prompt to parody *Attack of the Killer Tomatoes.
And it is a great analogy! In both cases the enterprise is one-shot: the rocket either reaches the moon or it doesn't, and we can't save our progress halfway and resume later — just like an ASI will either be safe or kill us, with no opportunity to learn from mistakes and retry. And in both cases, success requires every single component that could fail not to fail.
I'm not aware of anyone having made the misleading extension of the analogy, but it seems a natural one to reach for, so it seems worth preempting.
Otherwise it would be an identity.
The analogy does work on some levels — for instance, the task is hopeless and everyone would likely die if things were as dire as depicted.
The fact that the analogy is quite silly is not a perk; I simply couldn't think of less silly scenarios. I'd be happy if anyone comes up with a better one.