{"slug": "develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3", "title": "Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3", "summary": "NVIDIA has released Cosmos 3, a unified foundation model for physical AI that combines reasoning, world generation, and action generation into a single open-source system. The model uses a Mixture-of-Transformers architecture with separate reasoner and generator towers to interpret multimodal inputs and produce physics-aware video and action outputs. NVIDIA is open-sourcing the model checkpoints, training scripts, deployment tools, and datasets to enable developers to build robotic manipulation systems, autonomous vehicles, and warehouse monitoring solutions.", "body_md": "[Physical AI](https://www.nvidia.com/en-us/glossary/generative-physical-ai/) systems must understand the real world before they can act within it. Robots, autonomous vehicles, and smart spaces need to understand what’s happening in their world, predict what’s likely to happen next, and generate actions for specific environments, embodiments, and tasks.\n\n[NVIDIA Cosmos 3](https://www.nvidia.com/en-us/ai/cosmos/) is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model.\n\nNVIDIA is open sourcing Cosmos 3 models, training scripts, deployment tools, and datasets to make physical AI development more open and reproducible. This blog post covers the fundamentals of Cosmos 3, highlights key concepts from the [technical report](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf), guides through technical workflows, and shows how teams building robotic manipulation systems, [autonomous vehicles](https://www.nvidia.com/en-us/glossary/autonomous-vehicles/), and warehouse monitoring solutions can get started.\n\nKey highlights of this release include:\n\n- NVIDIA Cosmos 3 Nano and NVIDIA Cosmos 3 Super model checkpoints on\n[Hugging Face](https://huggingface.co/collections/nvidia/cosmos3)with code on[GitHub](https://github.com/nvidia/Cosmos). - Open datasets for physical AI applications like robotics and autonomous driving.\n- Open post-training scripts for adapting Cosmos 3 to your domain.\n- Cosmos NIM microservices for easy, optimized deployment on NVIDIA GPUs.\n\n## What’s new in Cosmos 3\n\nPrevious Cosmos releases separated world generation, physical understanding, and controlled scene generation into different models and workflows. This release unifies those capabilities with a [Mixture-of-Transformers](https://www.nvidia.com/en-us/glossary/mixture-of-transformers/) (MoT) architecture built around two towers.\n\n**Reasoner tower**: A[vision-language model](https://www.nvidia.com/en-us/glossary/vision-language-models/)(VLM) that interprets multimodal observations like images, videos, and text. This tower uses an autoregressive architecture to interpret the input and understand motion, object interactions, and other physical context. This serves as the ‘brain’ that reasons about the world before any generation happens.**Generator tower**: Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding. The reasoner can be called independently, but the generator always activates both towers for guided generation.\n\nThis architecture enables a single model to do reasoning and generation tasks, simplifying development by eliminating orchestration between multiple models and inference pipelines.\n\n### Choose the right model size\n\nTwo Cosmos 3 models are currently available:\n\nis the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.**Cosmos 3 Nano** is a 64B parameter model designed for maximum quality and capability. It delivers the highest benchmark scores and targets datacenter deployment on NVIDIA Hopper and NVIDIA Blackwell GPUs, making it suitable for large-scale synthetic data generation and advanced physical reasoning workloads.**Cosmos 3 Super**\n\n### Supported modalities\n\nCosmos 3 supports the following input and output modalities through its unified architecture:\n\nInput | Output | Application |\n| Text | Image | Physically-plausible Image generation |\n| Text | Video | Video | World model for rare edge case video data generation |\n| Text | Image | Video | World model for prediction |\n| Text | Image | Video | Text | VLM for reasoning |\n| Action | Video | Text | Video | Action-conditioned world model |\n| Video | Text | Video | Action | World action model, video action model, vision language action model, policy model for robot learning |\n\n*Table 1. Input and output modalities supported by Cosmos 3 for different applications*\n\n### Open datasets for physical AI\n\nWith the Cosmos 3 release, NVIDIA is open-sourcing six synthetic data generation (SDG) datasets on Hugging Face. These cover robotics, physics simulation, spatial reasoning, human motion, driving, and warehouse environments, and can be used for post-training Cosmos 3 and other models:\n\nPhysical AI World Model Synthetic Datasets include:\n\n[Embodied robot scenes](http://huggingface.co/datasets/nvidia/PhysicalAI-SDG-RobotSim)[Physical interaction scenes](http://huggingface.co/datasets/nvidia/PhysicalAI-SDG-PhysxSim)[Spatial reasoning](https://huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Spatial-Reasoning)[Digital human scenes](http://huggingface.co/datasets/nvidia/PhysicalAI-SDG-SynHuman)[Autonomous driving scenarios](http://huggingface.co/datasets/nvidia/PhysicalAI-SDG-DriveSim)[Warehouse operations scenes](http://huggingface.co/datasets/nvidia/PhysicalAI-SDG-WareHouse)\n\n## NVIDIA Cosmos Human Evaluation benchmark\n\nThe NVIDIA Cosmos Human Evaluation (HUE) framework assesses Cosmos 3 generator quality across representative domain tasks.\n\nAs SOTA video generation models saturate existing automated leaderboards, score differences between releases are often too narrow for meaningful comparison. HUE shifts evaluation from subjective grading to objective fact verification, enabling fine-grained comparison between top-tier models. The result is a more reliable quality signal for both rapid iteration and rigorous release decisions backed by full human evaluation.\n\nHUE evaluates video generation quality using atomic binary verification. Each generated video is decomposed into single-fact yes/no questions across four dimensions—semantic alignment, physical laws, geometric reasoning, and visual integrity—spanning seven Physical AI domains, including robotics, autonomous vehicles, and physics. These questions are generated by a VLM pipeline, refined by human experts, and released as open source on [Hugging Face](https://huggingface.co/datasets/nvidia/Cosmos-HumanEval-v1).\n\n## Benchmark results\n\nCosmos 3 has been evaluated across multiple benchmark suites covering physical AI reasoning, generation quality, and domain-specific performance.\n\n**Reasoning benchmarks**\n\nCosmos 3 Super and Cosmos 3 Nano lead on VANTAGE-Bench at the 32B tier and the 8B tier, respectively:\n\n[VANTAGE-Bench](https://huggingface.co/spaces/clemson-computing/VANTAGE-Bench-Leaderboard): First public benchmark for evaluating vision-language models on real-world fixed-camera footage across warehouses, transportation, and smart spaces.[Traffic Anomaly Reasoning](https://eval.aicitychallenge.org/aicity2026/submission/leaderboard?trackId=3&type=general)(TAR): A new leaderboard for detecting and reasoning anomalous events in transportation footage and the official leaderboard for AI City Challenge 2026 Track 3.\n\n**Generator benchmarks**\n\nCosmos 3 is the open-source SOTA and currently leads on PAI-Bench, R-Bench Physics-IQ, and RoboLab across public leaderboards:\n\n[Artificial Analysis](https://artificialanalysis.ai/): A benchmarking platform that ranks AI models for text, image, and video generation. Cosmos 3 is the leading open source model on the[Text to Image leaderboard](https://artificialanalysis.ai/image/leaderboard/text-to-image)and[Image to Video (no audio) leaderboard](https://artificialanalysis.ai/video/leaderboard/image-to-video?audio-output=false).[R-Bench](https://github.com/DAGroup-PKU/ReVidgen/): A benchmark for evaluating video-based world models in robotic video generation. It assesses task completion and visual quality through sub-metrics like structural consistency, physical plausibility, and execution completeness.[PAI-Bench](https://github.com/SHI-Labs/physical-ai-bench): A unified benchmark evaluating physical AI across video understanding and video generation, spanning domains like robotics, autonomous vehicles, and physics common sense.[Physics-IQ](https://physics-iq.github.io/): A benchmark of real-world videos that tests whether generative video models truly understand physical principles, rather than just achieving visual realism.[RoboLab](https://research.nvidia.com/labs/srl/projects/robolab/): A simulation benchmark for evaluating task-generalist robot policies.\n\n**Training recipes**\n\nA central component of the Cosmos 3 release is a fully open set of training recipes. Beyond model checkpoints, this release provides code, configs, and workflows for adapting Cosmos 3 to new domains, embodiments, and datasets.\n\n**Supervised Fine-Tuning** **post-training**\n\nSupervised Fine-Tuning (SFT) enables developers to adapt a Cosmos 3 model to their own data. The released recipes include vision generation post-training for custom video datasets, as well as action-oriented recipes for robotics and physical AI workflows. Developers can customize Cosmos 3 for their target domains across robotics, autonomous driving, and warehouse automation.\n\nThe [post-training code and configs](https://github.com/NVIDIA/cosmos-framework/blob/main/docs/training.md) are available on GitHub.\n\n**Action post-training**\n\nAction post-training adapts Cosmos 3 for action-aware Physical AI applications, including forward dynamics, inverse dynamics, and policy generation. Developers can post-train Cosmos 3 on action-labeled data. For robotics applications, this includes several important workflows: generating future observations conditioned on robot actions, inferring the actions behind observed demonstrations, and predicting action sequences from current observations and task prompts. This makes Cosmos 3 a strong foundation for world action modeling and policy learning.\n\n## Deploy with NVIDIA NIM Microservices\n\nCosmos 3 models are also available as [NVIDIA NIM microservices](https://build.nvidia.com/) for optimized, production-ready deployment. NIM microservices package the model with optimized inference runtimes, delivering high performance without the need to manually tune serving infrastructure. NIM microservices are easier to use for inference workflows compared to the Cosmos 3 repo on GitHub, which is preferred for post-training workflows.\n\nThe Cosmos 3 Reasoner NIM is available today, delivering the reasoning capabilities of the Cosmos 3 model. Keep posted for the Cosmos 3 Generator NIM, which provides full generation capabilities of the Cosmos 3 model.\n\n**Optimizations made to accelerate inference**\n\n**Quantization:** Cosmos 3 NIM supports selecting**BF16, FP8, or NVFP4** quantized checkpoints. The NVFP4 quantization reduces the model’s numerical precision from BF16 to 4-bit floating point, achieving up to 2x inference speedup.**vLLM:** Isan open source inference engine that uses techniques like continuous batching, paged attention, and tensor parallelism to serve LLMs efficiently. The Cosmos 3 Reasoner NIM serving stack is built on vLLM for higher throughput compared to conventional serving approaches. Cosmos 3 Nano is ready to run with vLLM-omni and NVIDIA Dynamo for top performance.**Efficient Video Sampling (EVS):** This technique reduces the number of video tokens fed into the VLM during inference, speeding up the Cosmos Reason NIM. EVS works at the chunk level, keeping the most unique chunks of each frame and pruning the rest. Smaller GPUs tend to benefit more from this technique.\n\n**How to run the NIM **\n\nAn NVIDIA NGC API key is required to pull the containers and download the Cosmos 3 models from NGC.\n\nTo pull and run the Cosmos 3 Nano Reasoner NIM. For the Cosmos 3 Super Reasoner NIM, specify `NIM_MODEL_SIZE=super`\n\n.\n\n```\ndocker run --gpus=all \\\n  -e NGC_API_KEY=$NGC_API_KEY \\\n  -e NIM_MODEL_SIZE=nano \\\n  -p 8000:8000 \\\n  nvcr.io/nim/nvidia/cosmos3-reasoner:latest\n```\n\nFind details on API usage and more in the [documentation](https://docs.nvidia.com/nim/vision-language-models/latest/introduction.html).\n\n**Get started**\n\n- Download the Cosmos 3 Nano and Super checkpoints on\n[Hugging Face](https://huggingface.co/collections/nvidia/cosmos3). - Find examples and code on the\n[Cosmos 3 GitHub](https://github.com/nvidia/Cosmos). - Try the\n[Cosmos 3 Nano Reasoner model experience](https://build.nvidia.com/nvidia/cosmos3-nano-reasoner)and the[Cosmos 3 Nano model experience](https://build.nvidia.com/nvidia/cosmos3-nano). - Join the community, open issues, and contribute to the Cosmos ecosystem on GitHub and\n[Discord](https://discord.com/invite/nvidiaomniverse).\n\n**Acknowledgments**\n\n*Cosmos 3 is the result of amazing collaboration between many teams and people across NVIDIA, including Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, and Artur Zolkowski*.", "url": "https://wpnews.pro/news/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3", "canonical_source": "https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/", "published_at": "2026-06-01 04:43:58+00:00", "updated_at": "2026-06-03 18:07:47.791019+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "robotics", "autonomous-vehicles", "generative-ai"], "entities": ["NVIDIA", "NVIDIA Cosmos 3", "Hugging Face", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3", "markdown": "https://wpnews.pro/news/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3.md", "text": "https://wpnews.pro/news/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3.txt", "jsonld": "https://wpnews.pro/news/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3.jsonld"}}