Scale Robot Reinforcement Learning with NVIDIA Isaac Lab on Amazon SageMaker AI NVIDIA Isaac Lab on Amazon SageMaker AI now enables robotics teams to train reinforcement learning policies for humanoid robots at scale, using either SageMaker HyperPod for persistent cluster training or SageMaker Training Jobs for ephemeral compute. The solution, demonstrated with the Unitree H1 humanoid, addresses the compute-intensive nature of robot policy training by offloading infrastructure management to AWS, allowing engineers to focus on developing robot behaviors rather than maintaining compute clusters. Artificial Intelligence https://aws.amazon.com/blogs/machine-learning/ Scale Robot Reinforcement Learning with NVIDIA Isaac Lab on Amazon SageMaker AI Physical AI is moving from research into production. Robots are increasingly trained in high-fidelity simulation before being deployed to factories, warehouses, and logistics centers, because training in the real world is slow, expensive, and often unsafe, while GPU-accelerated simulation can compress months of learning into hours. This shifts the challenge to compute. Reinforcement learning RL for complex behaviors like humanoid locomotion on rough terrain is compute-intensive, with single-node training runs stretching from hours to days. Robotics teams need to iterate quickly during research and also run production-grade, long-horizon training jobs without the operational burden of maintaining compute clusters. In this post, we show how to train robot policies for the Unitree H1 humanoid with NVIDIA Isaac Lab on Amazon SageMaker AI across two compute options: Amazon SageMaker HyperPod and Amazon SageMaker Training Jobs . The full code of this solution is available in the accompanying GitHub repository https://github.com/awslabs/awsome-distributed-ai/tree/main/3.test cases/pytorch/nvidia-isaac-lab . Image credit: NVIDIA 1. Why Amazon SageMaker AI for Physical AI training Amazon SageMaker AI removes the undifferentiated heavy lifting of managing compute infrastructure for machine learning ML training. The service provisions instances, configures drivers and networking, monitors node health, and tears down resources when jobs finish, so engineering effort stays on developing the robot policy rather than on the infrastructure underneath it. This is especially relevant for robot policy RL, which is infrastructure heavy: runs are long, GPU intensive, and often distributed across multiple nodes. Development typically involves two phases: short iterative experiments to tune reward functions, observation spaces, and model architectures, and longer production runs that train a tuned configuration to convergence. SageMaker AI provides two compute options that fit these phases. Cluster resiliency and control with SageMaker HyperPod SageMaker HyperPod https://aws.amazon.com/sagemaker/ai/hyperpod/ is a purpose-built, managed infrastructure for distributed training and inference of large-scale foundation models. Resiliency is at the core of SageMaker HyperPod. Hardware failures become an issue at scale, and each failure in a multi-node RL run means lost training progress plus time to detect the fault, replace the node, and restart from the last checkpoint. SageMaker HyperPod runs a health-monitoring agent on each node that performs basic and deep health checks. When a fault is detected, it automatically reboots or replaces the faulty instance. With auto-resume functionality, the training job restarts from the last checkpoint after the replacement node is ready, with no manual intervention. Orchestrated with Amazon Elastic Kubernetes Service Amazon EKS or Slurm, HyperPod provides direct access to cluster nodes and a stable environment that persists across runs. The HyperPod observability add-on publishes hundreds of cluster, node, and job metrics to Amazon Managed Service for Prometheus and visualizes them in pre-built Amazon Managed Grafana dashboards. Teams get GPU utilization, memory pressure, network throughput, and task-level performance without setting up a metrics pipeline. HyperPod task governance, built on Kueue, lets administrators carve the cluster into namespace-scoped queues with compute quotas, priorities, and preemption. Allocations can be defined per instance, per whole GPU, or per GPU partition with NVIDIA Multi-Instance GPU MIG . Fine-grained quotas cover accelerators, vCPU, and memory. Ephemeral compute with SageMaker Training Jobs SageMaker Training Jobs https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html are a fully managed, on-demand way to run containerized training workloads without maintaining any long-lived compute. Each job provisions GPU instances, pulls the container from Amazon Elastic Container Registry Amazon ECR , runs the training script, uploads artifacts to Amazon Simple Storage Service Amazon S3 , and terminates the instances when the job finishes. There is no idle compute cost between runs. This model fits the iteration phase of policy development, where reward functions, observation spaces, and network architectures change frequently between short runs. It is also a good fit for hyperparameter tuning sweeps, where many short runs run in parallel and then release their compute. 2. NVIDIA Isaac Lab and the training task NVIDIA Isaac Lab https://developer.nvidia.com/isaac/lab is an open-source robot learning framework built on NVIDIA Isaac Sim https://developer.nvidia.com/isaac/sim?size=n 6 n&sort-field=featured&sort-direction=desc . It uses GPU-parallel simulation to run thousands of robot instances simultaneously on one or multiple GPUs, turning what would be months of real-world experience into hours of simulated training. Isaac Lab provides structured APIs to define tasks, observation and action spaces, reward functions, and training loops for both reinforcement learning and imitation learning. Image credit: NVIDIA The sample training task in this post is Isaac-Velocity-Rough-H1-v0 , where a Unitree H1 humanoid robot https://www.unitree.com/h1/ learns to track velocity commands while walking across rough terrain. The robot must coordinate its 19 joints to maintain balance over procedurally generated uneven surfaces. Training uses Proximal Policy Optimization PPO through skrl https://skrl.readthedocs.io/ , one of several RL frameworks supported by Isaac Lab. Scaling to multiple nodes multiplies the number of parallel environments, producing more diverse experience per policy update and accelerating convergence. You can extend the scripts and configuration provided in this solution to other robot learning tasks. 3. Solution overview The solution in the accompanying GitHub repository https://github.com/awslabs/awsome-distributed-ai/tree/main/3.test cases/pytorch/nvidia-isaac-lab consists of two main parts: 1 a single Docker image that runs the training code on both SageMaker HyperPod and SageMaker Training Jobs, and 2 a generator script that renders the Kubernetes manifests and the SageMaker launch script from a shared configuration file. The two service options differ only in how the image is launched: as a Kubernetes PyTorchJob on SageMaker HyperPod, or through a CreateTrainingJob API call for a SageMaker Training Job. The H1 locomotion task used here is the same as in the NVIDIA Isaac Lab on AWS workshop https://catalog.us-east-1.prod.workshops.aws/workshops/075ce3fe-6888-4ea9-986e-5bdd1b767ef7/en-US/introduction , which runs the workload on Amazon Elastic Compute Cloud Amazon EC2 and AWS Batch. Moving to SageMaker AI keeps the training code unchanged and adds managed clusters, integrated fault recovery, and serverless training job execution. Training image The training container image is built from nvcr.io/nvidia/isaac-sim:5.1.0 . The provided Dockerfile clones Isaac Lab v2.3.2 , installs it into Isaac Sim’s bundled Python environment, and copies in the entrypoint script that parses the SageMaker Training Jobs resource config to launch torchrun . The full Dockerfile is in docker/Dockerfile . Both service options use the same image. Experiment tracking Training metrics are streamed to Amazon SageMaker managed MLflow https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html for persistent, searchable experiment tracking across both backends when a tracking server is configured. MLflow is opt-in: leave the tracking URI empty to disable it entirely. Section 4.5 track-experiments-with-sagemaker-managed-mlflow covers the configuration. Configuration and the generator script The generator script is configured through environment-specific variables defined in config.yaml . The generate.py script reads the configuration and renders the templates in templates/ into ready-to-apply files under generated/ . Running the generator is a single command: The specific files used by each backend are covered in the Section 4 walkthrough-training-on-sagemaker-hyperpod-with-amazon-eks and Section 5 walkthrough-training-on-sagemaker-training-jobs walkthroughs for SageMaker HyperPod and SageMaker Training Jobs respectively. Training topology across backends In the provided solution, both paths end with the same torchrun invocation of Isaac Lab’s skrl trainer on the same image. The primary difference is how each environment provides the topology to the container. On SageMaker HyperPod, the Kubeflow Training Operator injects MASTER ADDR , MASTER PORT , RANK , and WORLD SIZE into each pod. These describe the pod-level topology WORLD SIZE is the pod count, RANK is the per-pod index . The entrypoint forwards them to torchrun , which spawns one process per GPU within each pod. The per-pod launchers rendezvous through MASTER ADDR:MASTER PORT to form the global process group. On SageMaker Training Jobs, SageMaker writes the host list to /opt/ml/input/config/resourceconfig.json , and the container’s entrypoint parses it at startup. GPU instance compatibility Isaac Sim is built on NVIDIA Omniverse and uses the Omniverse RTX Renderer, which requires GPUs with hardware RT Cores. The G family of AWS GPU instances is suitable for Isaac Lab workloads. The P family is not, because it uses data center GPUs without RT Cores. See the Isaac Sim 5.1 requirements page http://docs.isaacsim.omniverse.nvidia.com/5.1.0/installation/requirements.html for the full list of supported and unsupported hardware. Instance family | GPU type and generation | RT Cores / Isaac Sim compatibility | ml.g5 | NVIDIA A10G Ampere | Yes | ml.g6 | NVIDIA L4 Ada Lovelace | Yes | ml.g6e | NVIDIA L40S Ada Lovelace | Yes | ml.g7e | NVIDIA RTX PRO 6000 Blackwell | Yes | ml.p4d , ml.p4de , ml.p5 , ml.p5e , ml.p5en , ml.p6-b200 , ml.p6-b300 , ml.p6e-gb200 | NVIDIA A100 Ampere , H100 / H200 Hopper , B200 / B300 / GB200 Blackwell | No | The examples in this post use ml.g6.12xlarge throughout. You can change the instance type in config.yaml . The ml.g6 , ml.g6e , and ml.g7e families support Elastic Fabric Adapter EFA at the 8xlarge size and above, which gives NCCL a kernel-bypass, RDMA-capable transport for multi-node collectives. Enabling EFA on HyperPod requires the AWS EFA device plugin and requesting vpc.amazonaws.com/efa resources in the pod spec. On SageMaker Training Jobs, you must configure EFA in the container image and in the virtual private cloud VPC configuration. EFA is automatically configured through the solution for both SageMaker HyperPod and SageMaker Training Jobs backends. The SageMaker Training Job setup is in the documentation https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-efa.html . Setup: Clone the repository and build the image Two setup steps are shared across both walkthroughs: cloning the accompanying repository and building the training image. Clone the solution’s repository: The repository contains the Dockerfile, the configuration template, the generator, and the entrypoint scripts used by both backends. Build the image from the repository root and push it to Amazon ECR. - Define the environment variables according to your setup: - Check whether the corresponding ECR repository exists, and create it if not: - Authenticate with Amazon ECR: - Build and tag the Docker image: - Push the Docker image to Amazon ECR: If you want to use Training Jobs instead, jump to Section 5 walkthrough-training-on-sagemaker-training-jobs . 4. Walkthrough: training on SageMaker HyperPod with Amazon EKS For this walkthrough, we use an existing SageMaker HyperPod cluster orchestrated by Amazon EKS, with a GPU instance group of two ml.g6.12xlarge nodes 4× NVIDIA L4 each, 8 GPUs total . The goal is a distributed training job for the H1 locomotion task, with live metrics in SageMaker managed MLflow https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html and the resulting checkpoints written to FSx for Lustre. 4.1 Prerequisites The solution requires the following prerequisites to be in place: - Sufficient service quota for the cluster and the chosen GPU instance type in the target region. HyperPod clusters consume the corresponding ml.g6. or other GPU family quota for SageMaker HyperPod . Request an increase through AWS Service Quotas https://console.aws.amazon.com/servicequotas/ before creating or scaling the cluster. - A SageMaker HyperPod cluster orchestrated by Amazon EKS with a GPU instance group of two ml.g6.12xlarge nodes. See Creating a SageMaker HyperPod cluster with Amazon EKS orchestration https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html . kubectl configured against the cluster, and the Kubeflow Training Operator https://github.com/kubeflow/training-operator installed in it so that PyTorchJob custom resources are recognized.- The FSx for Lustre CSI Driver https://github.com/kubernetes-sigs/aws-fsx-csi-driver installed, and an Amazon FSx for Lustre file system in the same VPC and subnet as the HyperPod nodes. This file system stores the logs and checkpoints written by the training job. 4.2 Configure and generate manifests - Copy the example configuration: - Fill in your environment values and AWS account ID, Region, and cluster details: Important configuration fields include the following: — these are used to form the container image URI aws , ecr