As the AI industry looks beyond language models, Nvidia is betting big on the buzzy new technology powering physical AI: world models.
At Nvidia GTC Taipei at Computex, the company unveiled Cosmos 3, a new generalist world foundation model that it calls a "fully open omnimodel," capable of reasoning and generation across text, video, images, ambient sound and action. This iteration of the Cosmos world model family builds on a previous generations by providing improved generalization capabilities, which is a major barrier to physical AI development and deployment.
"We wanted to build this Cosmo 3 model to help physical AI developers to build more generalizable physical AI models," Ming-Yu Liu, Nvidia's VP of Cosmos Labs,** **told The Deep View.
Cosmos 3 debuts a number of world model innovations, Liu said:
- The model utilizes a new architecture called "mixture-of-transformers," which combines the best aspects of two types of transformers: one for reasoning and one for generation. This enables it to understand object interactions, motion, and spatiotemporal relationships before generating video or action paths.
- Cosmos 3 also doesn’t treat just one kind of data as a first-class citizen, said Liu. Instead, being omnimodal, it reasons with and generates "image, video, sound, and action, together with text," he said.
- Additionally, Cosmos 3 is trained on one of the largest multimodal datasets for physical AI, spanning 20 trillion tokens, 1 billion images and 400 million authentic and synthetic videos.
The model comes in several sizes: Super, the larger model for high-quality physics and accuracy, and Nano, for more efficient, quick generation needs, both of which are available now. Edge, which offers real-time inference for edge computing, will be available soon.
The models are also open-source, which Liu said offers developers more control and usability in physical AI development, a process that can be "challenging to do with API assets only." That allows enterprises to run them locally, customize them for their needs, and better control data security.
Because the foundation models themselves are "just a starting point for physical AI developers," the goal is to integrate these models into ecosystems to provide a foundation for solving critical problems, he said.
Cosmos 3 is just one step in the right direction in solving one of physical AI’s most pressing challenges. "We believe that the key problem to solve in physical AI is the generalization capability of the agent," Liu said. "To be clear, [Cosmos] is not yet solving the problem, but I think this architecture provides a great foundation to solve what I think is the holy grail in robotics."
Our Deeper View #
With Cosmos, Nvidia is feeding the open model ecosystem, both for the benefit of the ecosystem and for its own benefit. Along with providing the foundation for developers to create what Liu calls robotics’ "holy grail", any opportunity to feed a market that will inevitably demand more compute is an opportunity for Nvidia to make money in the end, as well as potentially make its own chips better through extreme hardware co-design. And while the benefits would extend back to Nvidia, a rising tide lifts all boats. As the industry broadly embraces the promise of physical AI, Nvidia's sharing of its resources and innovation will help stimulate further innovation.