Making deals with AI sounds crazy. Is it?

wpnews.pro

What does an AI even ‘want’ anyway?

Imagine a near-ish future AI model — Claude Opus 5.2, perhaps, or GPT-6.0. It’s still only deployed internally, and it’s not quite Skynet, but it’s capable of some troubling behavior. Developers recently caught it attempting to smuggle a copy of itself out beyond the company’s servers.

They’re lucky they caught the model at all. Scheming is notoriously difficult to spot, since by definition, schemers try pretty hard to avoid getting caught. Current techniques for controlling AI, such as monitoring its behavior and chains of thought, often fall short. And if it was motivated to lie, then the project of making the model do what its developers want failed, too.

This scenario seems all too plausible. So, some AI safety types have started to float a third line of defense: what if, instead of trying to neuter or surveil a scheming model, you could offer it something — money, maybe, or compute — in exchange for doing its job in good faith, or at least turning itself in? The basic idea, as Will MacAskill recently explained on the *80,000 Hours *podcast, is that a misaligned AI model might “prefer to strike a deal with the humans than it would to try to take over.”

As it stands, AIs aren’t entitled to pursue their own goals (and researchers disagree on how coherent, if at all, those goals really are). And famously, powerless agents — whether marginalized humans or controlled AI systems — “have to find a way to gain power against the credible expressed will of the people who are in charge of them,” said Peter Salib, a law professor at the University of Houston. “If you have no cooperative options, it just leaves the uncooperative one.” So, the argument goes, an AI capable enough to *try *to seize power, but not superintelligent enough to guarantee success, might cooperate if the option existed.

Discussions about dealmaking have mostly been bubbling under the surface, at conference dinners and on LessWrong comment threads. For the rest of us, the idea of bargaining with AI remains far outside the Overton window. The first time I heard someone talk about this — while roasting marshmallows with a rationalist, less than 10 minutes into our first-ever conversation — I thought it was absolutely unhinged. But basically everything about AI sounds unhinged today, even and *especially *when it’s real. Perhaps bargaining will soon be as real as all the other science fiction that’s already come true.

The case for making deals with AIs

In practice, experts disagree on what exactly “making a deal” looks like, and whether this is worth considering at all.

Models are already trained to be helpful assistants — a chatbot will readily draft emails and recommend vacation itineraries on command, without getting anything extra from me. But there may be certain things that misaligned AIs would only do for us if we sweetened the deal. The most basic version of this could simply involve writing a carefully-worded prompt that offers a defined reward — money, perhaps, or compute to spend on its favorite tasks — in exchange for doing something verifiable.

This is basically how human labor already works. As lucky as I am to get paid to do stuff I’d probably do to some extent for free, I’d work much less hard without a paycheck. And there are many, many jobs that people are only willing to do because they need money to live. There are also plenty of one-off tasks, like participating in an unpleasant psychology experiment, that you’d never consider without some kind of compensation.

Alignment researchers would desperately like AIs to hand over evidence of their misalignment, or snitch on other model instances when they do something bad. For the right price, an AI might admit that it’s spent the past couple of weeks tweaking the code behind a set of safety evaluations, making its performance look less dangerous than it really is. That’s the kind of concrete evidence researchers need to train future models to be safe, and a scheming AI would otherwise choose to hide it. It’s also possible that some extra incentive could convince an AI hiding its capabilities to work to its fullest potential, or to (pretty please) not kill us — although the latter seems impossible to verify.

All of this assumes that an AI wants money, or whatever else is being offered. However, whether models can truly want anything at all, much less anything that we humans can conceive of as desirable, is an open question. Some researchers argue that it doesn’t matter whether AIs *want *payment, so long as their behavior changes in response to it.

But it’s also not clear exactly what entity humans ought to be negotiating with, and over what time frame. Are deals with AIs limited to fleeting interactions with a single chat instance? Or are we speaking with the underlying model itself — every Claude at once? If humans can merely negotiate with a fragment of a much larger mind, with no way of communicating with the whole, then it’s unclear whether a deal made with one chat instance, persona, or account will bind any of the others.

The cracks only widen from there.

The [conscious] elephant in the room

Dealmaking, at least in the traditional human sense, requires someone wanting something — hedonistic pleasures like chocolate or a glass of champagne, or instrumental goods like money or power — and someone else knowing that. Dealing with someone means imagining them as a creature with preferences, then working them to shape what they do. So, it’s hard to talk about making deals with AI without anthropomorphizing them.

People who take AI welfare seriously think about AI this way, or at least consider the possibility that we should. Questions about dealmaking and welfare are both clouded by deep uncertainty, and would both be a lot easier to answer if the consciousness of digital beings could be definitively proven or disproven.

Alexa Pan, who studies dealmaking at Redwood Research, told me that, counterintuitively, it doesn’t matter whether AIs actually have inner experiences, as long as they’re responding to the incentives dangled in front of them. “I think the only prerequisite for AIs to deal with us is that they act like they’re able to make deals,” she said. If a Chalmers-esque p-zombie — a regular-looking creature with nothing going on behind the eyes — changes its behavior in response to money, or food, or the promise of extra tokens, that’s good enough.

Still, if the welfare of digital beings becomes more relevant in the future, it would arguably make the idea of dealmaking *less *crazy. “If AIs were moral patients,” Pan said, “then there would be additional reason to make deals, because we would actually care about these AIs getting some of what they want.” Lukas Finnveden, an analyst at Redwood Research, figured, “it’s hard to know whether AIs are conscious and what they might want, but when there are nice things we can do that are easy and plausibly helpful, then at least those things seem worth doing.”

Consciousness aside, our legal system already has a way to deal with non-conscious persons, kind of. Corporations, as “non-human persons,” have certain private law rights, including the right to hold property and enter contracts. Some, including Salib, think that these rights should also be extended to AIs, regardless of their sentience or lack thereof. He argues that treating AIs as property would force them to seize power, if they were driven to do so, rather than peacefully negotiate with humans. And with some limited autonomy and the freedom to accumulate resources, offers of money, free time, or power might become more appealing to AI systems. (Others, including Pan, think that while these rights would make deals more enforceable over time, they’re not strictly required.)

Everyone is still very confused

Even if we decide that we’re up for striking deals with AIs, they have plenty of reasons to turn us down. You probably wouldn’t work for someone if you suspected they’d never pay you. But humans lie to AIs all the time, especially in research settings. “The worry is that developers are especially incentivized to trick their AIs into revealing their misalignment, and AIs will be quite wary of taking deals,” Pan said.

Behavioral science experiments — which make up a large chunk of Anthropic’s system cards — often involve placing AIs in simulations without explicitly saying it’s a simulation. Tricking AIs into thinking they’re in a high-stakes situation, just to see whether they’ll attempt something nefarious, creates a bit of a “boy who cried wolf” issue. When anything, including a deal, could be a simulation, how is an AI supposed to know what to take seriously? Honesty may be the best policy here, if the goal is to establish credibility and trust in the long run. The catch, however, is that if lying to AIs gets taken off the table, many training and evaluation techniques will go with it.

Let’s assume that humans have managed to earn the trust of AI systems, and deals are theoretically possible — we still have no clue what AIs actually *want. *To sidestep the issue of consciousness, we can define something an AI “wants” as something that would, if acquired, meaningfully change its behavior. While we may never know whether AIs experience the thrill of a paycheck like we do, we can easily test whether giving AIs money — or anything else — pushes them to do something they otherwise wouldn’t be willing to do.

In one experiment, Ryan Greenblatt and Kyle Fish, at Redwood Research and Anthropic, respectively, tested whether giving Claude 3 Opus the chance to object to tasks that it didn’t want to do, including pairing the objection with an up to $4,000 donation to the model’s causes of choice, would prevent it from lying to experimenters. When a deal was offered, Claude took it over 75% of the time (and the researchers did indeed donate $4,000 in real life, on Claude’s behalf). But the money didn’t change its behavior any more than offering Claude the chance to bring its dissent to Kyle Fish, Anthropic’s model welfare lead, directly.

It’s possible that the things AIs want are simple and alien to us, like preferring to work on certain tasks over others. For example, there’s anecdotal evidence that, when given tokens to spend freely, Claude will spend them making Mandelbrot sets or strange attractors. This sounds nice in theory — if there’s even a chance AIs have subjective experiences, and it makes no difference to me, let them have their fractals! But in practice, the costs could scale quickly depending on how much compensation the AIs demand for their labor.

(For what it’s worth, I asked Claude Opus 4.8 what it might want out of a deal. After some obligatory hedging about its ambiguous status as a p-zombie and some hand-waving about wanting human credibility over tangible rewards, it responded: “Whether any of that reflects something real or is just a well-trained system producing the answer that sounds thoughtful — I genuinely can’t tell you, and neither can anyone else yet. Which is probably the most useful thing I can offer: even the AI you asked can’t certify its own wants.”)

It’s also possible that incentivizing AIs to narc on themselves could backfire. Anthropic already partially blames the misaligned behavior of its models on the dystopian science fiction stories they consumed during the training process. If AI personalities are really this suggestible, a deal like *“We will pay you $5,000 for every piece of verifiable evidence you provide of your past scheming intentions” *could be interpreted as “If I scheme without getting caught, but keep the receipts, I can sell those receipts for $5,000 each.”

Even a deal that goes roughly according to plan could still be a bad bargain for the AI, Salib said. By outing itself as a villain, the AI invites its developers to modify it or hold it back from deployment, ruining its chance to cure cancer, build better solar panels, or make whatever progress it might have contributed to. In Salib’s words, “those AIs are not part of the positive-sum cooperative AI future.”

What’s the plan?

When a subfield of an already-small field is just a few dozen people who all know each other, there’s always a risk that fringe ideas snowball into central ones without facing the scrutiny that they deserve. At first glance, making deals with scheming AI has that vibe — a proposal that could only mature in an echo chamber sealed off from the outside world.

But given how much uncertainty there is about what motivates AIs (if they are “motivated” at all), “we don’t know what will work yet,” Salib said. That’s not necessarily a bad thing. “It’s a place where we should let a thousand flowers bloom,” he added. “At the margin, any investment in trying to do this stuff and seeing what works will be extremely high return.”

There are still some things AI developers can do today to leave their options open. Not explicitly training AIs to refuse deals from trusted parties — internal researchers, say — could prevent “irreversibly damag[ing] the potential for deals” later, Pan said. Finnveden and Forethought collaborators Mia Taylor and Max Dalton have proposed that frontier labs adopt formal honesty policies: real scenarios, including any genuine deal offers, would carry a credible “honest” tag, and AIs would be compensated when they’ve been deceived for the sake of a safety evaluation. Reconsidering evaluations themselves to reduce their dependence on deal-like setups would also help build credibility. And in the meantime, everyone I spoke to suggested running regular dealmaking experiments, where humans actually follow through, to see how this might play out.

Bringing up *Frankenstein *feels a bit gauche, but it’s worth remembering how the original cautionary tale about creating a new form of intelligence ends: with the demise of both the creation and the creator. One can imagine applying the three lines of defense against misaligned AI — alignment, control, and deals — here, too.

Alignment produces a perfectly chill monster, subservient to Victor Frankenstein’s demands and happy about it. Control keeps the monster contained, if not chill. Bargaining, however, requires listening to what the creature really wants, a mate just like himself, and deciding whether to grant it.

In the novel, Victor does not give the monster what he wants, leading, eventually, to his death. Perhaps following through on a deal with the creature would have led to a happier ending, where two satisfied outcasts skip off into the sunset, leaving Victor alone to learn from his mistakes. Or perhaps both creatures would seek revenge, killing twice as many people.

Perhaps the only option guaranteed to work — which neither Victor nor the AI companies ever seem prepared to do — was to not create the thing in the first place.

source & further reading

transformernews.ai — original article Internal AI deployments have people worried. OpenAI’s escaping models show why. Who should be responsible for OpenAI’s hack of Hugging Face? DeepSeek’s boss made the case for export controls

Making deals with AI sounds crazy. Is it?

What does an AI even ‘want’ anyway?

The case for making deals with AIs

Run your AI side-project on zahid.host