# Distributed AI on AWS

> Source: <https://www.day1training.com/>
> Published: 2026-06-17 09:11:09+00:00

#
AWSome

Distributed AI

Reference architectures, test cases, and best practices for training large-scale models with PyTorch, Megatron-LM, NeMo, JAX, and more on AWS infrastructure.

## Training Frameworks

Production-ready examples grouped by framework. Each includes Dockerfiles, Slurm scripts, and Kubernetes manifests.

[ 🔥 FSDPDDPDeepSpeedTorchTitanPicotronvLLMTRLOpenRLHF ](/frameworks/pytorch)

### PyTorch

Native distributed training with DDP, FSDP, TorchTitan, DeepSpeed, and more. Covers LLMs, vision, robotics, and RLHF.

[ ⚡ Megatron-LMNeMoNeMo RLBioNeMo ](/frameworks/megatron)

### Megatron

NVIDIA Megatron-LM and NeMo for large-scale LLM pre-training with tensor, pipeline, and expert parallelism.

[ 🧬 PaxMLXLATPU/GPU ](/frameworks/jax)

### JAX

Google JAX with PaxML for distributed training leveraging XLA compilation and automatic parallelism.

[ 🧠 NeuronXOptimum NeuronTrainium ](/frameworks/neuron)

### AWS Neuron / Trainium

NeuronX Distributed for training on AWS Trainium & Inferentia chips with optimized compilers.

[ 🤖 Isaac LabOpenVLAV-JEPA 2nanoVLM ](/frameworks/physical-ai)

### Physical AI & Robotics

Embodied AI training with NVIDIA Isaac Lab, OpenVLA, V-JEPA2, and vision-language-action models.

[ 🎯 TRLvERLSLIMEPPODPO ](/frameworks/reinforcement-learning)

### Reinforcement Learning

RLHF, DPO, PPO, and scalable RL frameworks for LLM alignment and post-training.

[ 🧪 DistillationCompressionTransfer Learning ](/frameworks/model-customisation)

### Model Customisation

Knowledge distillation, compression, and model adaptation techniques for production.

## Reference Architectures

CloudFormation templates and deployment guides for every AWS compute platform.

## Get Started in Minutes

Three steps to launch your first distributed training job.

### Deploy Infrastructure

Launch a cluster using our CloudFormation templates for HyperPod, ParallelCluster, or EKS.

### Build Container

Use our Dockerfiles to build a training container with your framework of choice.

### Launch Training

Submit your job with Slurm or Kubernetes using our ready-made launch scripts.
