DVD-JEPA – a JEPA world model that dreams a bouncing DVD logo

Researchers developed DVD-JEPA, a Joint-Embedding Predictive Architecture world model that learns the physics of a bouncing DVD logo from pixels without explicit coordinates. The model predicts future frames in a latent representation space and can detect anomalies when reality deviates from its expectations, demonstrating a pure JEPA that operates without a decoder.

DVD-JEPA — a world model that dreams a bouncing logo A small but real Joint-Embedding Predictive Architecture: a context encoder, an EMA target encoder, and a predictor that imagines the future in representation space. It learned the physics of a bouncing DVD logo without ever being told a coordinate. The decoder is optional — a pure JEPA only speaks in vectors. Everything below is the trained model running client-side; no server, no GPU. Realityground truth JEPA's expectationdecoded Predictive surprise reality vs. expectation surprise: —⚠ ANOMALY DETECTED The model's mind — 32-d latent z mode: monitor Tip: turn the Decoder off to see what a pure JEPA actually gives you — just the 32 latent bars. It understands the bounce perfectly and refuses to draw it. Turn it back on to render the dream. Hit Inject anomaly to teleport the logo and watch the surprise meter spike. 01 / predict Future in latent space The predictor steps one tick forward as a vector, not a picture. Trained to match an EMA target encoder's embedding of the real next frame — the core JEPA objective. 02 / render The optional decoder A pure JEPA has no decoder. Bolt one on and the latent dream becomes pixels — turning the model into a future-frame video predictor you can actually watch. 03 / detect Surprise = anomaly When reality stops matching the rendered expectation, prediction error spikes. That's a usable anomaly signal — the same job a real egocentric-video world model does.