#
AWSome
Distributed AI
Reference architectures, test cases, and best practices for training large-scale models with PyTorch, Megatron-LM, NeMo, JAX, and more on AWS infrastructure.
Training Frameworks #
Production-ready examples grouped by framework. Each includes Dockerfiles, Slurm scripts, and Kubernetes manifests.
🔥 FSDPDDPDeepSpeedTorchTitanPicotronvLLMTRLOpenRLHF
PyTorch
Native distributed training with DDP, FSDP, TorchTitan, DeepSpeed, and more. Covers LLMs, vision, robotics, and RLHF.
⚡ Megatron-LMNeMoNeMo RLBioNeMo
Megatron
NVIDIA Megatron-LM and NeMo for large-scale LLM pre-training with tensor, pipeline, and expert parallelism.
JAX
Google JAX with PaxML for distributed training leveraging XLA compilation and automatic parallelism.
🧠 NeuronXOptimum NeuronTrainium
AWS Neuron / Trainium
NeuronX Distributed for training on AWS Trainium & Inferentia chips with optimized compilers.
🤖 Isaac LabOpenVLAV-JEPA 2nanoVLM
Physical AI & Robotics
Embodied AI training with NVIDIA Isaac Lab, OpenVLA, V-JEPA2, and vision-language-action models.
Reinforcement Learning
RLHF, DPO, PPO, and scalable RL frameworks for LLM alignment and post-training.
🧪 DistillationCompressionTransfer Learning
Model Customisation
Knowledge distillation, compression, and model adaptation techniques for production.
Reference Architectures #
CloudFormation templates and deployment guides for every AWS compute platform.
Get Started in Minutes #
Three steps to launch your first distributed training job.
Deploy Infrastructure
Launch a cluster using our CloudFormation templates for HyperPod, ParallelCluster, or EKS.
Build Container
Use our Dockerfiles to build a training container with your framework of choice.
Launch Training
Submit your job with Slurm or Kubernetes using our ready-made launch scripts.