Hallucination in World Models Is Predictable and Preventable

Researchers at a major AI lab trained a 350M-parameter generative world model on 210 tasks and found that hallucination in world models is predictable and preventable, primarily a data-coverage problem. They introduced MMBench2, an open-source benchmark with 210 tasks across 10 domains, to study and mitigate these hallucinations, which can lead to incorrect downstream decisions in planning and policy learning.

Live interaction with our 350M-parameter world model trained on 210 tasks. Control it with your keyboard Our hallucination predictors run at every step; a red border indicates that a hallucination is detected. Can you make the world model hallucinate? Tap the on-screen keys to move, and tap the frame to pause. Modern generative world models render strikingly realistic, action-controllable futures. But the rollouts they produce frequently hallucinate: they stay visually fluent and superficially plausible while drifting away from the ground-truth dynamics. When used downstream for planning or policy learning, model hallucination leads to incorrect decisions. In this work, we train a 350M-parameter generative world model on a large dataset spanning 210 tasks and show that, even at this scale, hallucination is both predictable we can predict when it will happen and preventable the underlying issue is, to a great extent, fixable . An open-loop rollout from our 350M-parameter base model right vs. its ground truth left . The imagined trajectory looks visually plausible but largely ignores the action sequence it was conditioned on. This is exactly the type of hallucination we set out to study. We argue that hallucination in world models is, first and foremost, a data-coverage problem, making it both predictable and preventable. Studying coverage needs three things no benchmark offered at once: full control of the training pipeline, behaviorally diverse data across many tasks, and live simulators to probe the gaps online. So we built MMBench2 which includes ground-truth actions, rewards, language instructions, and a live environment for every task. Naturally, MMBench2 is fully open-source. MMBench2 includes 210 tasks spanning 10 domains. Tasks include locomotion, manipulation, navigation, arcade-style environments, and more. All clips are generated by our 350M-parameter base model trained on MMBench2. If you look closely, you may notice occasional hallucinations. ↔︎ drag to explore The corpus contains an equal number of trajectories per task but is imbalanced in terms of frames. Episode lengths range from 25 ManiSkill3 to 1,000 Atari steps, so the frame distribution is heavy-tailed. That non-uniformity is exactly the coverage structure we set out to study. Per-task frame counts across all 210 tasks, sorted high→low and colored by domain log scale . Hover any bar for the task; the dashed line marks the per-task median of 65,260 frames. On MMBench2 we train a 350M-parameter world model that largely follows the Dreamer 4 recipe. It consists of a video tokenizer, an action-conditioned dynamics model, and a video decoder. Any of its three components can fail independently, resulting in hallucination. A video tokenizer encodes each frame into a continuous latent code z, trained jointly with the decoder via masked autoencoding. A block-causal Transformer predicts the next latent from past latents and an action token, trained with shortcut flow-matching. Encoder and decoder are frozen during dynamics training. A decoder renders latent codes back to pixels. The decoder is used for supervision during tokenizer training, and human viewing at test-time. Because the stages compose sequentially, a hallucination introduced early e.g. a corrupted encoding is propagated and amplified by everything downstream. Naming which stage produced a failure is therefore the first step to fixing it. We identify three types of hallucination, each of which trace to specific components of the world model. In the following, we contrast stable predictions ✓ with each type of hallucination × . Surprisingly, the world model can hallucinate before any dynamics prediction at all. When the encoder/decoder is presented with an unseen observation, it may sometimes snap that unfamiliar structure onto the nearest scene it knows; for example, dropping a small object or even reconstructing an unseen maze as a seen one. For a dynamics model to be useful for decision-making, it needs to respond to actions reliably: a different action should lead to a different outcome. If the training data has limited action diversity, the world model is likely to marginalize over actions, i.e, generating the same trajectory regardless of the action. A world model can be expected to suffer from compounding error as rollout horizon increases. However, we find that — regardless of rollout horizon — dynamics can also diverge rather abruptly when entering low-coverage regions of the state space. This may result in the agent teleporting, penetrating walls, or objects suddenly disappearing. We find that model hallucinations can be detected at runtime. We derive three label-free predictors that are computable on the fly from quantities the model already produces. Based on these metrics, we can then predict and visualize exactly where hallucinations will happen. $u r = \lVert \hat z - \mathrm{Enc} \mathrm{Dec} \hat z \rVert$ How far a predicted latent moves when its decoded frame is re-encoded. On-manifold predictions survive the round trip; hallucinated ones drift. How much the denoiser's clean-frame prediction moves between Euler substeps. A well-conditioned step settles fast; an under-conditioned one keeps oscillating. How much the next-latent prediction varies across independent denoising seeds. Concentrated predictions indicate a well-determined transition; dispersed ones indicate where rollouts diverge. In practice we use dynamism-normalized variants $u^{\text{norm}} = u/m$, dividing out per-step scene motion $m$ so each hallucination predictor tracks uncertainty relative to how much is happening in the scene. If we visualize hallucination as measured by our predictors in this case tokenizer round-trip residual $u r$ across the state space, a pattern becomes clear: $u r$ is high exactly where there is low state density in the training data. Yes Across 9k held-out sequences, each hallucination predictor tracks the realized rollout error at Spearman $\rho \approx 0.8$, without requiring any labels or additional training. Each point corresponds to a held-out 24-frame trajectory. Hover a point to trace that same sequence across all three panels. The purple curve is the median; the dashed line marks the scene-divergence threshold $\Delta$PSNR = 0 . Perhaps most interestingly, we can also use our three metrics to detect hallucination at run-time. In the following, we visualize normalized values for each of the three predictors as a function of time. Select a rollout below and watch the predictors reach their hallucination threshold as the rollout starts diverging. For comparison, we also include examples of rollouts where no hallucination occurs. Per-frame round-trip residual, flow instability, and inter-seed variance from a single autoregressive model rollout, each shown relative to their hallucination threshold. If every failure mode is a coverage gap, one data-centric lever should move all three at once. We resample the existing corpus to be uniform across tasks rather than frames, upweighting under-represented tasks and improving results at no additional cost. Mean change vs. the base model on held-out trajectories across 200 tasks, applying coverage-aware training to both tokenizer and dynamics. We observe sizable improvements in model quality while all three hallucination predictors are down. Since hallucination is found to be a data coverage gap, a simple yet effective strategy is to correct model error via targeted data collection. During live environment interaction, we roll out candidate trajectories in the world model, score them by predicted hallucination, and execute the most hallucination-prone one. This allows us to reduce model hallucination with just 50 trajectories per task collected autonomously. Point Maze OGBench , an unseen layout. The base model quietly drifts to a seen layout; targeted data collection and finetuning recovers the true geometry. Dungeon Explorer MiniArcade , an unseen transfer task. The base model drifts into a scene from a visually similar Atari game; after finetuning with just 50 trajectories the model faithfully models the dynamics of this new task. Cup Catch DMControl , an unseen variant. The base model hallucinates visuals seen during training; finetuning restores the correct visuals. Reacher Easy MiniArcade , an unseen task. The base model dissolves the scene entirely; finetuning restores the arm and its target but the dynamics prediction still diverges eventually. These clips are qualitative; the finetuned model simply looks right. However, what matters in the end is whether a world model is good enough to act with. To answer this, we evaluate closed-loop planning MPC performance with each of 6 finetuned models, varying only the data source the policy used to collect those 50 trajectories . If curiosity-based data collection using our proposed hallucination predictors is effective, its downstream task performance should approach privileged data collection strategies that rely on humans or expert policies. Why do the data collection strategies rank the way they do? We find that it is, yet again, related to coverage. The figure below shows trajectories collected via each policy on a maze with a central bottleneck. Curiosity tends to target walls, which is exactly where the model has been found to hallucinate by e.g. penetrating. In summary, we show that finetuning with just 50 trajectories can greatly improve world modeling in both seen and unseen tasks, and that a majority of these gains can be realized autonomously via curiosity-based exploration targeting hallucination-prone regions. To support further research on generative world modeling, we release: If you find our work useful, please consider citing the paper: A selection of key works this paper builds on, ordered by year. The complete bibliography is available in our paper paper.pdf .