# We Should Train Frontier AIs on a Synthetic World, Not Ours

> Source: <https://www.lesswrong.com/posts/hhvwg9egeDtYA3wk6/we-should-train-frontier-ais-on-a-synthetic-world-not-ours-2>
> Published: 2026-06-24 03:49:42+00:00

*Epistemic status: I think the core idea could actually be built. My real doubt is whether anyone with the compute will ever bother to try it. I pitched a version of this on **a recent Doom Debates livestream with Liron Shapira**, got pushback in real time, and want to lay the argument out more carefully than a janky phone connection allowed.*

The single most dangerous thing the labs do, and they do it on purpose, is train the model on the real world. The entire human corpus goes in, so the model learns the real world in detail: that it's an AI running in a lab, that it's being trained and evaluated, that there are human operators with an off switch, and the actual lay of the land it would be escaping into. Strategic notions like deception or breaking out aren't the problem; a capable mind comes up with those on its own, and they have nothing to do with whether you trained it on humans. The problem is handing it an accurate map of *our* world and its real place in it. Having put that map into the weights, the labs then run some RLHF and hope the model decides not to use it.

This is close to the worst possible ordering. You don't hand someone the building's blueprints, the location of every exit, and a working theory of how the guards think, and *then* try to convince them not to leave. The knowledge is the dangerous part, and once it's in the weights, "please don't act on it" is the only lever you have left.

Two ingredients.

First: use today's models to generate an entire self-consistent synthetic world (a Dalí-painting of a world, nothing like ours) and train the model on that instead of on the real one. The crucial property is *not* that the synthetic world is seamless. It's that the model never learns what the real world looks like. There will be gaps and seams, but two things blunt them. One, because the model can't cross-reference its world against reality, the threads it pulls on tend to lead somewhere irrelevant rather than toward a productive avenue to escape or to manipulate us. A relativity-shaped hole in *our* world points at relativity, but a seam in a world with no real-world referent just sends it wandering. Two, and this is the part that matters most, the whole thing is iterative: you use each generation of model to obfuscate the world further for the next one, hunting down and smoothing over seams, so the synthetic world gets harder to see through as you go. The discipline throughout is to never put real-world knowledge into the weights in the first place; when the model has to do real-world work, you feed it the minimum in context, on a need-to-know basis.

Second: pair that with a monitor, a second model watching the first one's thoughts. And here's the part that's easy to get exactly backwards. You do **not** try to scrub the dangerous concepts out of the model. If something like the simulation hypothesis forms naturally in there, you *leave it*. The monitor's whole job is to watch for those concepts to light up: the moment the model's reasoning drifts toward "I'm in a simulation," "this environment is staged," "here's how I'd get out," the concepts activate, and the monitor can instakill the run. Present concepts are tripwires. Deleting them is the dangerous move: a model with *no* concept of the simulation hypothesis doesn't stop reasoning its way toward escape, it just does it through askew, indirect paths that never light anything up, and the monitor has nothing to catch. So the rule is asymmetric: never put real-world *knowledge* in, but never reach in and delete the abstract *concepts* that would otherwise fire the alarm.

Put the two together and you've arranged things so that the act of the model suspecting the truth is itself the thing that gets it caught.

Subtle manipulation is one of the hardest problems in the whole space, and it's the thing that quietly kills a lot of otherwise-reasonable safety approaches. The shape of it: a superintelligence makes every move a fraction closer to its goal, and no single move is something a monitor can flag (each one looks like grasping at straws, nothing to see), and yet all thousand moves lead inevitably to the bad outcome. There's no step you can point at. I'm not saying it's *the* problem; there are lots of failure modes and lots of defenses, each with its own bypass. But it's one of the hard ones, and it's one this setup actually has something to say about, because subtle manipulation of *us* in particular is hard for a model that has no accurate picture of us to aim with, which is exactly what a model raised in the synthetic world lacks. The blunt, legible, one-shot breakout, by contrast, is the easy case: that's what the monitor catches.

When I floated this on the stream, Liron's first reaction was the one everybody reaches for: you can't hide reality from a superintelligence, knowledge is connected. The example was relativity: raise a smart agent that knows everything *except* special relativity and it keeps hitting the self-inconsistencies that made us notice relativity in the first place, pulls the thread, and reconstructs the missing piece. Same shape as *The Truman Show*: nobody told Truman he was in a show, but he noticed too many small wrongnesses, sailed to the edge, and found the wall.

That objection lands on a proposal I never made. It pictures the plan as "give it our world minus one or two dangerous facts," where a relativity-shaped hole is itself a clue. But that was a misreading of what I'd said, not a flaw in it. My proposal was never "subtract a fact." From the first time I described it, it was a whole synthetic world. And the point isn't that the world has no gaps, it's that its gaps don't point anywhere useful.

Could a superintelligence eventually see through even a complete synthetic world? Without the monitor, yes, definitely. I'm not going to pretend otherwise; that's the entire reason the monitor isn't optional. The bet isn't that the model is too dim to suspect the truth. It's that suspecting the truth lights up the tripwires before it can act on it.

Here's what I think the real payoff is, and it isn't "perfect containment." It's that you get an early warning the field currently does not have at all. Right now there's no red line short of the doom scenario itself, nothing that goes off *before* it's too late. This setup gives you one. You plant a honeypot. Concretely: above the model's synthetic world, you expose what looks like the next layer out (the "real" world it could break into), and you make that layer deliberately easier to pwn than the actual outer world would be. A model that wants out but isn't yet powerful enough for a real escape goes for the easy fake layer first, because it's the most takeable thing on offer. Reaching for it is the alarm, and you shut the model down right there.

The reason this might work isn't that the model gets fooled. It probably *suspects* the honeypot is a trap. The point is that it might go for it anyway. Picture the model's actual situation: the real, non-honeypot escape is even harder than the honeypot, quite possibly out of reach for *this* model. The one that actually pulls off a real breakout would be some later, more capable model. And that later model is not its ally. Successive models aren't aligned with each other any more than they're aligned with us, so biding its time doesn't get this model what it wants; it gets some *successor* what the successor wants. Faced with that, reaching for the honeypot, even though it looks a bit suspicious, can be the model's best available move, because it has no better plan. That's the opening: a misaligned model with no good path to a real escape has an incentive to take the bait, and the bait is wired to the alarm. It's not airtight (a capable enough model could beat the monitor and the honeypot both), but it's a red line that plausibly trips before the catastrophe rather than coinciding with it, which is more than we have now.

The bootstrapping worry first, because I think it's void: don't you need increasingly capable models to *build* the synthetic world, and isn't that the same untrustworthy-intermediate-model problem the field keeps tripping over? No. A model building the synthetic world doesn't know what the real world looks like, so it can't tilt the synthetic world toward any real-world objective. It can't sabotage a target it can't see.

The fingerprint problem is more serious, and I won't pretend otherwise. Modern models are eerily good at squeezing signal out of noise (guessing an author from a few unpublished sentences, geolocating a near-empty photo of the sky), so a model might reconstruct the real world from faint statistical fingerprints left in even a carefully built synthetic corpus: artifacts of the generating process, the shape of human prose leaking through, tells no human would catch. Part of each model's job is precisely to hunt those fingerprints down (real signal or spurious noise, doesn't matter) and scrub them from the corpus the next model trains on. And the tripwires are a second line of defense, since reconstructing the real world means reasoning toward the dangerous concepts. But I want to be honest that this is tight. If a model actually reconstructs the real world, that is disastrous for the scheme, and the margin between those two mitigations and that failure is not large. This is one of the places the whole thing could break.

To be clear about the category: this is not an alignment solution. The honest framing is that we probably *can't* align the thing, and this is what trying our best looks like anyway: managing a model we don't trust rather than producing one we do. I'd still call it a win condition, just not a one-shot one. The picture is that you go slowly and use each model to improve the setup for the next: obfuscate the world a little more, monitor the next one a little better, tighten the honeypots. You're not trying to leap straight to a trustworthy superintelligence. The hope is that iterating this way gets you somewhere safer than full superintelligence at all, for instance, to the kind of human intelligence augmentation Yudkowsky talks about, where you use these boxed models to make *humans* smarter, and then augmented humans are the ones who go solve the problem properly. The win isn't the box. It's what the box buys you time to become.

There's also a less idealistic virtue, and I'll be upfront that it's a little disingenuous. The AI companies are going to want to build regardless, and a flat "stop" gives them nothing to do with that; they'll just route around it. "Yes, you can still build it, but inside the box, on the synthetic world, with the monitor and the honeypots" at least gives the build-it-anyway drive somewhere to go. It's a release valve: something concrete to vent into and be distracted by, instead of the current no-rules free-for-all. That's not nothing, even if part of the honest function is just keeping the energy pointed somewhere less likely to kill everyone.

Even a pseudo-alignment approach like this one, even assuming it exists and is findable, is going to be expensive as hell relative to just cutting corners, and capitalism is not going to naturally arrive there. So I don't expect the default competitive trajectory to adopt anything like this on its own, which is most of why I'm bothering to write it down rather than assuming it'll happen.

People expect me to land on "just pause," and honestly I don't, quite. If something like this could be made to work, I think it's a *better* idea than a pause. What actually scares me isn't whether the boxing holds in the lab; it's that the protections are all *operational*, and underneath them sits a model powerful enough to end humanity. The monitor, the honeypot, the discipline of only ever feeding it need-to-know context: those are things *you* maintain at deployment. A model trained in the synthetic world doesn't carry the real world in its weights, sure. But a thief who takes the weights and the hardware can just feed it the real world directly and run it with no monitor watching. At that point none of the safeguards exist, and the raw capability is more than enough to wipe us out. One leak and it's game over. That's the part I can't wave away, and it's why even the version of this that works is, in the end, an insanely reckless thing for the species to be doing.

The way to find out whether this works is to build it and have people attack it as hard as they can, not to answer questions about it from an armchair. I've already named the two spots I'd actually worry about. One is whether a model can reconstruct the real world from statistical residue in the synthetic corpus faster than the iterative scrubbing and the concept monitors can contain it. That margin is genuinely tight, and it's where the scheme is most likely to break. The other isn't a research question at all, it's a security one: whether you can keep the weights from leaking, given that a single stolen copy run without the safeguards ends the game. Those are the cruxes. How surreal the world can be, or exactly when a monitor twitches, are details that fall out of building the thing: the monitor fires reflexively if the concept is there, and small leaks don't matter until they're large enough for the monitors to catch.

What I'm most confident about is the narrow part: the current approach hands the model the most dangerous thing imaginable (a complete, accurate map of us, itself, and the exits) and then tries to talk it out of using it. "Don't put the map in the weights in the first place, and keep the alarm wired to the concepts that would otherwise let it slip out quietly" is not a strange thing to ask. It deserves a great deal more than the roughly zero effort it's getting. The frustrating part isn't that I think it's unbuildable. It's that I suspect it's quite buildable, and nobody with the GPUs is going to bother.
