Neural Voxel Dynamics: Learning Implicit 3D Physics via Volumetric Feature Advection

Researchers introduced Neural Voxel Dynamics, a self-supervised framework that learns implicit 3D physics from video by lifting 2D features into a volumetric latent space. The method achieves long-term structural stability and physical plausibility on benchmarks without relying on explicit simulators, offering a scalable path toward general-purpose dynamic world models.

arXiv:2606.26410v1 Announce Type: new Abstract: We present a self-supervised framework for learning implicit 3D physical dynamics directly from video-derived supervisory signals. While current generative video models achieve high visual fidelity, they lack a 3D geometric foundation, often resulting in physical inconsistencies and a failure to maintain object permanence. We address this by shifting the predictive bottleneck from 2D image space to a lifted' 3D Volumetric Latent Space. Our method unprojects semantic features from a Video Joint-Embedding Predictive Architecture V-JEPA into a voxelized grid, grounded by monocular depth priors. This lifting enables a Volumetric Feature Advection to learn an action-conditioned transition operator that treats physics as a spatio-temporal state advection problem, i.e., learn implicit 3D physics. Unlike state-of-the-art hybrid models that rely on explicit classical simulators for training and/or inference, our architecture tracks material states implicitly within high-dimensional V-JEPA features. This allows for the emergent simulation of heterogeneous phenomena e.g., rigid body motion in fluid flow within a single, unified pipeline. Supervised solely via end-to-end video-derived signal plus action conditions, without access to physics engine internal states, labels, or surrogate models, our model demonstrates good long-term structural stability and physical plausibility on multiple benchmarks CLEVERER, PhysInOne, PhysGaia . We believe that this work opens a scalable pathway toward general-purpose dynamic world models that internalize the 3D invariants of the physical world solely through passive observation of monocular videos.