{"slug": "gem-4d-geometry-enhanced-video-world-models-for-robot-manipulation", "title": "GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation", "summary": "Researchers have developed GEM-4D, a geometry-grounded video world model that generates physically consistent video predictions for robot manipulation by injecting dense 4D correspondence supervision from a pretrained geometry foundation model. The model achieves state-of-the-art performance on video prediction and geometric consistency, improving real-world manipulation success rates from 61% to 81% by converting correspondence-consistent video rollouts into executable robot trajectories.", "body_md": "arXiv:2605.22882v1 Announce Type: new\nAbstract: Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at the project page: https://anonymous-submission-20.github.io/gem.github.io/.", "url": "https://wpnews.pro/news/gem-4d-geometry-enhanced-video-world-models-for-robot-manipulation", "canonical_source": "https://arxiv.org/abs/2605.22882", "published_at": "2026-05-25 04:00:00+00:00", "updated_at": "2026-05-25 15:19:47.950114+00:00", "lang": "en", "topics": ["robotics", "computer-vision", "generative-ai", "machine-learning", "artificial-intelligence"], "entities": ["GEM-4D"], "alternates": {"html": "https://wpnews.pro/news/gem-4d-geometry-enhanced-video-world-models-for-robot-manipulation", "markdown": "https://wpnews.pro/news/gem-4d-geometry-enhanced-video-world-models-for-robot-manipulation.md", "text": "https://wpnews.pro/news/gem-4d-geometry-enhanced-video-world-models-for-robot-manipulation.txt", "jsonld": "https://wpnews.pro/news/gem-4d-geometry-enhanced-video-world-models-for-robot-manipulation.jsonld"}}