How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo

NVIDIA released Alpamayo, an open portfolio of AI models and simulation frameworks for autonomous vehicle development, including the AlpaGym closed-loop training framework that enables post-training of AV policies through reinforcement learning. The framework connects AlpaSim simulator rollouts directly to the policy training loop, allowing models to learn from the consequences of their own actions in simulation rather than optimizing only against logged expert trajectories. This addresses the critical gap between open-loop training and closed-loop deployment, where small prediction or planning errors can compound over time in real-world driving scenarios.

Developing autonomous vehicle AV https://www.nvidia.com/en-us/solutions/autonomous-vehicles/ policies requires bridging an important gap between training and deployment. Vision-language-action VLA https://www.nvidia.com/en-us/glossary/reasoning-vision-language-action/ models that can reason over more complex driving scenes and produce richer intermediate reasoning are predominantly trained in open-loop, where model outputs are directly compared to ground-truth behaviors without considering their effect on the environment. In deployment, however, a driving policy runs in closed-loop, where every braking, steering, and navigation decision affects the environment, and small errors can compound over time. A systematic means to address this challenge is provided by NVIDIA Alpamayo https://www.nvidia.com/en-us/solutions/autonomous-vehicles/alpamayo/ , an open portfolio of AI models, simulation frameworks, and physical AI https://www.nvidia.com/en-us/glossary/generative-physical-ai/ datasets for AV development. Alpamayo includes the AlpaSim https://github.com/NVlabs/alpasim AV simulation platform and the AlpaGym closed-loop training framework coming soon . This post explains how to train AV models in closed-loop with NVIDIA Alpamayo. Specifically, it walks through how to: - Install and configure AlpaGym - Define closed-loop rewards - Launch closed-loop training - Export the post-trained checkpoint for downstream use Closed-loop post-training with AlpaGym extends AV training workflows by turning AlpaSim rollouts into training experience. Rather than treating simulation only as a final evaluation stage, AlpaGym connects simulator feedback directly to the policy training loop. How to use AlpaGym for closed-loop reinforcement learning Reinforcement learning RL https://www.nvidia.com/en-us/glossary/reinforcement-learning/ can be used to improve a policy that was initially trained in open-loop. Instead of optimizing only against logged expert trajectories, the model can now learn from the consequences of its own actions in simulation. This shift is critical for AV development, where small prediction or planning errors can compound over time. In closed-loop training, each braking, steering, and navigation decision affects the next state of the environment, revealing failure modes that static datasets or open-loop evaluation may miss. However, enabling closed-loop RL comes with its own challenges. Model inference, running simulation, training models, syncing weight updates, communicating across instances and moving data—all in parallel—is complex. This requires orchestration and efficient utilization of compute resources in a robust yet flexible manner. To address these challenges, AlpaGym connects policy training to AlpaSim closed-loop rollouts and provides an open source, high-throughput framework for closed-loop RL. The system combines AlpaSim simulator microservices https://github.com/NVlabs/alpasim , NVIDIA Physical AI Open Datasets https://huggingface.co/collections/nvidia/physical-ai , and distributed NVIDIA Cosmos-RL https://github.com/nvidia-cosmos/cosmos-rl training framework into a scalable post-training pipeline. Built to scale seamlessly from a single GPU to multi-node GPU clusters, AlpaGym supports efficient large-scale training through an asynchronous and stable distributed RL pipeline, without requiring changes to user code. It integrates AlpaSim and Cosmos RL as its runtime and orchestration layer, GRPO as a default algorithm, and includes reference reward functions tested with Alpamayo models and the Physical AI AV NuRec dataset https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NuRec . To get started with AlpaGym post-training, follow the steps outlined below. Step 1: Install and configure AlpaGym To install AlpaGym from the Alpamayo checkout, install the native CUDA dependencies and Redis on the host, then sync the UV workspace: sudo apt-get update sudo apt-get install -y libcudnn9-dev-cuda-12 \ libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \ redis-server git-lfs git lfs install git lfs pull huggingface-cli login Or export HF TOKEN=... uv sync --all-packages sudo apt-get update sudo apt-get install -y libcudnn9-dev-cuda-12 \ libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \ redis-server uv sync --all-packages The Python environment is managed by uv , but cuDNN, NCCL, and the redis-server binary are host dependencies used by the CUDA model stack and Cosmos-RL. Alternatively, a suitable Dockerfile is also provided. Hugging Face authentication is required to download the scene artifacts. An AlpaGym run is a Hydra configuration. It specifies the policy checkpoint, the AlpaSim scene set, rollout parallelism, reward function, and Cosmos-RL training parameters. In this workflow, the starting checkpoint is an Alpamayo model. Step 2: Define the closed-loop reward The reward should match the behavior you want to improve in closed-loop. For trajectory-quality post-training, common reward terms include progress, lane keeping, collision avoidance, offroad rate, comfort, and distance to a reference trajectory. A practical first reward is intentionally simple: combine progress with penalties for safety-critical failures. In AlpaGym, this can be expressed as a small sum of terms, using AlpaSim metrics where possible: reward/progress safety.yaml terms: - kind: metric metric name: progress scale: 1.0 - kind: metric metric name: collision any scale: -10.0 - kind: metric metric name: offroad scale: -5.0 Once the pipeline is stable, add more targeted terms for the failure modes observed in AlpaSim videos and metrics. Step 3: Launch closed-loop post-training Start AlpaGym training from your model checkpoint. Alpamayo serves as an example model here. uv run -m alpagym host.cli \ policy=alpamayo \ policy.model.kind=alpamayo r1 \ policy.model.path=/path/to/checkpoint \ reward=progress safety This will bring up AlpaGym with AlpaSim on a single GPU. Stay tuned for detailed instructions on how to use your own AV model. During training, AlpaGym requests scene rollouts from AlpaSim, collects per-episode artifacts, computes rewards, and updates the policy. Useful training signals include mean reward, reward variance, failure rates, policy loss, rollout throughput, and the gap between generated rollouts and the latest policy weights. In this recipe, these rollout artifacts and training signals are the primary outputs of the post-training run. They help you confirm that closed-loop learning is running correctly and select checkpoints for downstream evaluation on your own held-out AlpaSim scenario suites. Step 4: Export the post-trained checkpoint After training, place the AlpaGym-produced checkpoint and config files into a folder that can be accessed by the AlpaSim driver your Hugging Face model cache, for example . Then create a new driver config with that folder path called alpamayo1 CLRL here . See the following code for what to edit to specify custom paths in a driver yaml config. This makes the AlpaGym post-trained policy runnable inside AlpaSim for closed-loop rollouts. ... model: model type: alpamayo1 checkpoint path: "/root/.cache/huggingface/alpasim models/alpamayo1 CLRL/step NNNNNN" device: "cuda" ... Next, run the exported model on a representative scenario to verify that the policy, driver, and simulation loop are connected correctly. At this stage, you can inspect how the policy behaves when its own actions affect the next state of the environment. uv run alpasim wizard deploy=local topology=1gpu driver=alpamayo1 CLRL wizard.log dir=$PWD/tutorial alpamayo CLRL scenes.scene ids= clipgt-9ea70552-6dcb-4ee8-a368-9a906a333f6e A closed-loop rollout provides useful qualitative signals: whether the model produces stable trajectories and remains within the drivable area, how it reacts to nearby traffic agents, and which failure modes should be targeted during post-training. With this checkpoint, teams can inspect rollout videos, per-episode metrics, reward traces, and failure cases collected during training. These artifacts are useful for debugging reward design, checking rollout stability, and selecting checkpoints for later held-out evaluation in AlpaSim. Get started post-training AV models Closed-loop post-training provides a practical path for iterating on end-to-end driving policies. In this case, AlpaGym uses closed-loop rollouts to post-train AV policies in simulation, enabling them to learn from the consequences of their actions. You can use these tools together with the other components of the NVIDIA Alpamayo Open Platform to develop reasoning models that can be run, inspected, and post-trained in a closed-loop simulation workflow. Extend this same recipe more broadly with your own rewards, scenarios, and evaluation suites. Ready to get started? Check out the NVlabs/alpamayo-recipes https://github.com/NVlabs/alpamayo-recipes GitHub repo to adapt the recipe in this post for your own use cases. To evaluate your model on a public leaderboard, see the two open AV challenges NVIDIA launched at CVPR 2026: To learn more, see Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across Models, Data, and Simulation https://huggingface.co/blog/drmapavone/nvidia-alpamayo-1-5 . Join NVIDIA founder and CEO Jensen Huang for the NVIDIA GTC Taipei 2026 Keynote https://www.nvidia.com/en-tw/gtc/taipei/keynote/?nvid=nv-int-bnr-823296 and dive deeper with related sessions https://www.nvidia.com/en-tw/gtc/taipei/session-catalog/?tab.catalogallsessionstab=16566177511100015Kus / .