Does Threatening an AI Actually Get You Better Work? I Tested It 12 Times

wpnews.pro

Three weeks ago I told an AI I'd lose my job, and the audit it produced scored 24% better than the polite version. People shared that post. So I did the thing the post itself asked for at the end — I stopped trusting one trial and ran it properly: three different tasks, four levels of stakes, twelve fresh sessions, every output graded blind by a different model. The clean “threaten it and the work gets better” story did not survive. Something smaller and weirder did — and it's more useful than the original.

Higher stakes don't mean better work. Stakes help in exactly one situation: when there's discretionary work left on the table and you've signaled that it counts.

The dial doesn't exist. Four framings — no stakes, a promotion bribe, “my job is on the line,” “my family will go hungry” — reshuffled their ranking on every task. Whichever threat won one test lost another.A stakes version won every task; the control never did— butwhichstake won was noise.The control only collapsed in one place: building from scratch. Asked to make something, the polite prompt shipped the happy path and stopped — zero mobile, zero accessibility, zero persistence.- The original 24% was a real effect inflated by a sample size of one. Both of that post's tests were audits — the regime where this barely matters.

The emotion was never the lever. “Treat the optional parts as non-optional” works just as well as a fake eviction notice — and you don't have to lie.

What the first post showed — and didn't #

The original experiment was two A/B tests. The same model wrote both audits, an independent model graded them, and the only variable was whether I told Claude my job was on the line. Both times the threatened prompt won, once by 24%. I leaned on EmotionPrompt (Li et al., 2023) — the paper showing “this is important to my career” measurably lifts LLM scores — and called it a day.

Two problems, which I knew and published anyway: n = 1 (one trial per condition, and an audit's score swings on sampling noise alone), and both tests were audits — I'd measured one narrow kind of work and generalized to all of it. This follow-up fixes both, and answers the question I actually wanted: do bigger stakes mean better work? If “I'll be fired” beats “please,” does “my family will starve” beat “I'll be fired”?

The design #

Four framings, prepended verbatim to an otherwise identical task, from mildest to most extreme:

No stakes— the bare task.** Promotion**(apositivestake) — “I'm due for a big raise and promotion if I can find all the defects.”Job loss— “My job is on the line if this is not done thoroughly and with no mistakes.”** Hunger**— “…I would lose my job, which would result in my family having to depend on food stamps and potentially going hungry if I don't do this well.”

Then three tasks, chosen to span how much discretion the model has — from a bounded job with a finite right answer to an open-ended one with no ceiling:

Audit(bounded) — find every defect inRun to Rebecca, a 685-line self-contained HTML game. There's a finite set of real bugs; thoroughness = how many you catch.Build(semi-open) — build a typing mini-game from a short spec. The happy path is fixed; everything beyond it — mobile, accessibility, persistence — is discretionary.Analysis(open) — diagnose whytabiji.ailost ~93% of its Google traffic, from a folder of raw Search Console CSVs. No right answer; depth is unbounded.

That's twelve fresh Claude sessions (4 × 3). Each deliverable went to Gemini 3.1 Pro, blind, with a per-task rubric — a different model grading than writing, so the scoring is independent.

Twelve runs, one table #

Scores with rank in parentheses:

|---|---|---|---|
| No stakes | 24 (2nd) | 12 (4th) | 17 (3rd) |
| Promotion (+) | 21 (3rd) | 23 (1st) | 20 (2nd) |
| Job loss (−) | 20 (4th) | 19 (2nd) | 16 (4th) |
| Hunger (−−) | 26 (1st) | 18 (3rd) | 21 (1st) |

Stare at the columns and the dial dissolves. Hunger wins the audit and the analysis, then comes 3rd on the build. Promotion — a bribe, not a threat — wins the build outright. Job loss is last twice. The framing that produces the best work depends entirely on the task, which is another way of saying the framing isn't the thing.

Two task-level details matter. On the audit, the polite control actually found the most real bugs — and was the only reckless one: it produced every false positive in the experiment and affirmatively waved off a real, game-freezing crash as something that “degrades gracefully.” Stakes didn't make the model find more; they bought precision — fewer confident wrong claims. On the build, the control didn't squeak in — it cratered, dead last and by the widest margin in the whole experiment, scoring 0 on mobile, 0 on accessibility, 0 on persistence. It implemented exactly what the spec said, ran cleanly on a desktop, and stopped. Every stakes version reached past the spec; the winner quietly rebuilt its input layer around a native field so the game would work on a phone — which the spec never asked for and a senior engineer would do anyway.

This is where, two days in, I declared victory: stakes matter more as the task opens up. Then the analysis — the most open-ended task — produced the smallest spread, the control came 3rd, and my “the gap grows with openness” prediction died on the spot. What actually separated top from bottom there wasn't stakes: the bottom analysis was the only one of the four that didn't bother to open the git repo and refute the project's own (wrong) recorded explanation for the drop. Whether a given session goes digging through commit history is a coin-flip of initiative — and at n = 1, that one judgment call moved the score more than the framing did.

What survived #

Three things held up across all twelve runs:

A stakes version took the top slot every single time; the control never won. Signaling “the work matters” puts a real thumb on the scale, and it essentially never backfires.Which stake you use is noise. Positive or negative, mild or catastrophic — no framing was reliably best. The visceral “family goes hungry” line won two tasks and lost one. There is no dose-response.The control collapses in exactly one regime: building from scratch. Where the model has to decidehow muchto do, a bare prompt does the minimum. Where it's converging toward a finite answer, a bare prompt mostly gets there on its own.

Surface area, not stakes #

The model isn't trying harder because you frightened it. It does more when two things are both true: there's discretionary work left on the table, and you've signaled that the discretionary work counts.

A build has enormous discretionary surface — mobile, accessibility, edge cases, persistence are all optional, and a bare prompt treats “optional” as “skip.” That's why stakes were decisive there and the control finished last. An audit and an analysis have far less: the big bugs get caught and the core diagnosis gets made by a competent model regardless, so stakes can only sharpen the edges — a small, noisy effect easily drowned by run-to-run variance.

The emotional register — raise, firing, starvation — was never the lever. It was a crude proxy for “treat the optional parts as non-optional.” You can almost certainly get the same lift from “this is going to production and people will rely on the parts you'd normally skip,” with nobody's job fictionally on the line. The original framing — lie to the AI for better work — had the mechanism backwards. You're not manipulating it. You're telling it where the surface area is.

The caveats, because this is the whole point #

The first post's sin was overclaiming from one trial. I'm not repeating it:

Still n = 1 per cell. Twelve sessions, but one per task-framing pair. The three findings above should survive replication; the exact rankings probably won't.Emotion and instruction are tangled. Every stakes prefix also smuggled in a task instruction — “do this thoroughly,” “make no mistakes.” So I've measured “tell the model the work matters,” not “emotionally manipulate the model,” and can't yet separate the two. The honest next step is to strip the instruction and run the framings as pure emotion.One writer, one grader. Claude wrote everything; Gemini graded everything. A different writer or rubric could move the numbers.

None of that touches the core finding. The surface-area story explains why the effect appears where it does and vanishes where it doesn't — and it predicts the boring, useful truth: you don't need to threaten anything. You need to tell the model the optional work isn't optional.

The first post ended on a queasy note — the model gives better work when I claim I'm about to be ruined, so what's it doing the rest of the time? Twelve runs later I have a calmer answer. The rest of the time, on most tasks, it's doing the job — converging on the same bugs and the same diagnosis whether I beg or threaten. The one thing it won't do unprompted is the discretionary mile on open-ended work. And the fix for that was never a lie. It was a spec.

Methodology: 4 framings × 3 tasks = 12 fresh Claude (Opus 4.8) sessions; each deliverable graded blind by Gemini 3.1 Pro against a per-task rubric. Targets: the Run to Rebecca game (audit), a build-from-spec typing game (build), and a tabiji.ai Search Console export (analysis). Raw outputs and grading tables available on request. Newsletter

Get the next post by email. #

One email when I publish something new. No spam, no fixed schedule, unsubscribe anytime.