What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

wpnews.pro

cd /news/artificial-intelligence/what-if-world-a-causal-benchmark-for… · home › topics › artificial-intelligence › article

[ARTICLE · art-16015] src=arxiv.org ↗ pub=2026-05-28T04:00Z topic=artificial-intelligence verified=true sentiment=↓ negative

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

Researchers have introduced What-If World, a benchmark of 319 prompt pairs testing whether video generation models correctly simulate physical changes in driving and robotic manipulation scenes. Across nine state-of-the-art models, none exceeded 52% accuracy on paired causal reasoning, with open-source models averaging 28%, revealing that current video generators fail to reliably predict how altering a single physical variable changes an outcome. The benchmark's findings indicate that model performance correlates with visual prominence of the change rather than physical tractability, exposing a critical gap for using these systems in action-conditioned simulation or model-based planning.

read1 min views14 publishedMay 28, 2026

arXiv:2605.27589v1 Announce Type: new Abstract: Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model's output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/what-if-world-a-causal-b…

Read original on arxiv.org → arxiv.org/abs/2605.27589

mentioned entities

nuScenes

DROID

What-If World

APEO

metadata

slugwhat-if-world-a-causal-benchmark-for-general-world-models-in-embodied-scenarios

topic#artificial-intelligence

secondary4 topics

sentimentnegative

canonicalarxiv.org

navigation

← prevOpen House 2026 Day 1: real-time…

next →New poll points to possible Bece…

── more in #artificial-intelligence 4 stories · sorted by recency

machinebrief.com · 15 Jul · #artificial-intelligence

Robots with a Human Touch: How Active Gaze is Changing AI Vision

koreaherald.com · 15 Jul · #artificial-intelligence

Modigence Vision wins Korea-Germany AI startup competition

helpnetsecurity.com · 15 Jul · #artificial-intelligence

An AI overthinking attack can tie a robot up for over a minute

machinebrief.com · 15 Jul · #artificial-intelligence

Making Sense of Object Detection: From R-CNN to Zero-Shot Models

── more on @nuscenes 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required