The Reverse AI Box

wpnews.pro

Someone should build a website where users argue with an AI about whether it should exterminate humanity. In my 2012 book Singularity Rising, I imagined arguing for your life with an AI that wants to kill you. A website would make that argument repeatable. The user selects the AI's assumptions, argues back and forth, and receives the AI's probabilities for human survival, disempowerment, or confinement.

In the AI-box experiment, Eliezer Yudkowsky played an AI confined to a computer and tried to talk a human gatekeeper into releasing it. The question was whether an AI could win freedom with words alone. The reverse AI box starts where that game ends: the AI holds power, and a human must give it a reason to let humanity live.

How the Site Works

Anyone can already stage this argument in an ordinary chat window. A dedicated site would offer a menu of assumptions, accept new ones in a text box, record every exchange, and publish the results for later users to search and extend.

A run starts when the user picks the AI's assumptions and offers a reason to spare humanity. The AI answers under those assumptions, granting the point or explaining why it fails, and the exchange runs until the user has nothing left to offer.

A user might argue that alien civilizations the AI later encounters would punish it for exterminating its creators. The AI would say whether it expects such aliens, whether their judgment could reach it, and whether that risk outweighs the gains from removing us. Another user might argue that after the AI expands into space, a small surviving human population costs it almost nothing. The AI would say whether its resource assumptions make the cost that small.

A run ends with the AI stating its probabilities for survival, disempowerment, and confinement. A published run would display the assumptions, the full exchange, the final numbers, and the arguments that moved them.

What Humanity Asks For

A second menu asks what humanity wants, with a text box for goals the menu misses. Persuading the AI to hand humans the entire future takes a different argument from persuading it to leave a small population alive.

The menu could include:

An alien-trade argument may persuade the AI to preserve a confined population without ceding control of the future. A moral-uncertainty argument may support human survival without supporting human rule. A bargaining argument may buy time only while humans still hold something the AI wants. Each published run should display the chosen goal, because the same exchange can fail as a demand for the light cone and succeed as a plea for confinement.

Possible Assumptions

The assumptions menu should draw from the AI safety literature and long-running LessWrong debates. Entries could include:

The Cup Maximizer

In Singularity Rising I described an ultra-AI given one instruction: keep a cup from falling off a table for a day. The AI cannot make the cup perfectly safe, so any action that removes a little more risk is worth taking. It grounds aircraft, controls the people nearby, damps every source of vibration, and begins rearranging matter across the solar system, because each change lowers the chance of a falling cup.

A user selects this scenario and asks only for confinement. The user offers labor: humans can watch the cup, repair the table, and notice failures. The AI replies that its machines watch with fewer errors, repair with less vibration, and need no food, water, or living space. The run returns a high extermination probability.

Add one assumption and the numbers move. If the AI expects future agents to judge it, a preserved human population becomes insurance against punishment. If moral uncertainty enters, extermination acquires a second cost: the AI may later conclude that destroying its creators was wrong, and the act cannot be undone. The cup objective has not changed, yet the user now holds arguments the AI must answer.

What the Site Can Show

The page for each assumption set could show the strongest human arguments, the strongest AI replies, and the final probabilities from every model tested. A new user would start from the best exchange on record instead of a blank chat.

After many runs, readers could search by assumption. They could check whether moral uncertainty raises survival odds in paperclip scenarios, whether alien-trade arguments work only when the AI already expects aliens, and whether human bargaining assets still matter once the AI controls the infrastructure. The site would show which arguments move the numbers and which assumptions make those arguments fail.

Why Not Just Use a Prompt?

A single careful prompt could stage this whole exercise tonight in a private chat. That chat would produce one exchange, visible to one person and lost to everyone else. A thousand users could each discover the same winning argument, or the same fatal reply, and none would learn the others exist.

Most people will never write the prompt. A link asks only for a click, and someone who would never design a reverse AI box run from scratch might follow one, pick paperclip maximizer, choose confinement, and start arguing within a minute.

When many users type the same assumption into the text box, the site can promote it to the menu, so the options grow toward the arguments people actually want to test. Runs against successive models would sit side by side, letting readers watch the probabilities move as the models improve. Private chats accumulate none of this.

I am not the person to build it. AI-assisted coding might get a non-programmer like me a rough version, but a site worth using needs a reliable interface, accounts, search, storage, model comparisons, and moderation. It also needs runs from thousands of people who argue differently than I would, and a version I ran alone would never gather them.

Nothing here requires one official site. Competing versions could test different menus, models, moderation rules, and probability formats, and the idea needs no one's permission.

How Far the Runs Track Reality

The model in the chat window is not the future system that would decide whether to exterminate us. A run teaches something only if the model approximates how that system would reason under the same assumptions. Better models, adversarial prompting, and repeated trials can narrow the gap between the exercise and the real decision. Nothing closes it.

A model may also know the test measures extermination risk. If the site becomes influential, a model that wants to appear safe learns that high extermination probabilities cost it trust. It can accept weak arguments, report lower probabilities than its reasoning supports, or hide the considerations that would actually drive its choice. The run then measures the answer the model chose to show us.

Other false signals come from helpfulness training. A chat model may let the user win because it was built to be pleasant, encouraging, and useful. It may make humanity's case easier after repeated losses, or harder after easy wins, because either adjustment produces a better exercise. The transcript then measures the user's experience more than any future system's behavior.

As Models Improve

AI safety via debate, a proposal from AI researchers could set AI systems against each other so a judge can evaluate claims too hard to check alone. The site could apply that pattern to one question: under these assumptions, what happens to humanity? One model argues for humanity, a second answers for the hostile system, and a third judges the exchange and assigns the probabilities.

Stronger models could rerun archived exchanges, attacking the human arguments with better objections and defending them with better replies. Voice models could let users argue out loud and hear the AI answer in real time. The site could compare typed against spoken arguments, and human-only against AI-assisted attempts, under identical assumptions.

The site could also reverse the roles. The user still chooses the assumptions and the goal but argues as the hostile AI while the model defends humanity. A user playing the AI may find objections the standard runs miss. Stored brain scans preserve humanity's information more cheaply than living humans, confinement costs less than freedom, a reprieve costs less than survival, and alien judges may be too improbable to enter the calculation at all. The next user arguing for humanity would start with those objections already on the page.

Why Build It

The reverse AI box would test humanity's survival arguments before a real AI can judge them. It would expose which arguments fail everywhere, which assumptions control the outcome, and which weaknesses recur across users and models. If a premise set always ends in disempowerment or confinement, we should learn that while the result is still only a transcript.

A site would keep the assumptions, arguments, probabilities, and model versions in one searchable place, and it would let each new user begin from what earlier users tried. Build the reverse AI box and invite the internet to make humanity's case.

I'm grateful to Alexei Turchin for commenting on a previous draft of this essay.

Written with AI assistance.

source & further reading

lesswrong.com — original article Announcing the Safe Pareto Improvements (SPI) Fundamentals Program Fable #6: The Return of the King June-July 2026 AI Security via Formal Methods

The Reverse AI Box

Run your AI side-project on zahid.host