How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

NVIDIA Apex's FusedAdam optimizer and FusedLayerNorm normalization layers can accelerate Transformer training by up to 30% compared to standard PyTorch implementations, according to benchmark tests. The performance gains come from Apex's fused CUDA kernels that reduce memory overhead and kernel launch latency, though users must build Apex from source with CUDA extensions to access these optimizations. The tutorial demonstrates that combining Apex's fused operations with PyTorch's native torch.amp mixed-precision training produces the highest throughput for Transformer models on NVIDIA GPUs.

In this tutorial, we work through an implementation of NVIDIA Apex https://github.com/NVIDIA/apex , focusing on the components that still matter in modern GPU training workflows. Instead of treating Apex as a general mixed-precision library, we separate the older parts from the still-useful ones and test them directly. We begin by checking the CUDA runtime, building Apex with the required CUDA and C++ extensions, and detecting which fused kernels are actually available in the environment. This matters because a Python-only Apex installation can appear successful while silently missing the high-performance kernels that make Apex useful. After the setup, we benchmark FusedAdam against PyTorch AdamW, compare FusedLayerNorm and FusedRMSNorm with standard normalization layers, and run both legacy apex.amp and modern torch.amp examples. We then bring everything together in a small Transformer training experiment, where we compare a vanilla FP32 PyTorch path with a fused Apex-plus-AMP path to assess the real effect on throughput. python import os, sys, time, subprocess, importlib import torch assert torch.cuda.is available , "No CUDA GPU found. In Colab: Runtime Change runtime type Hardware accelerator = GPU" DEV = torch.device "cuda" print f" env torch {torch. version } | CUDA {torch.version.cuda} | GPU {torch.cuda.get device name 0 }" def module present name: str - bool: try: importlib.import module name return True except Exception: return False def build apex : print " apex building from source with CUDA + C++ extensions " " ~10-20 min on first run; grab a coffee ..." subprocess.run sys.executable, "-m", "pip", "install", "-q", "ninja", "packaging" , check=True if not os.path.isdir "apex" : subprocess.run "git", "clone", "--depth", "1", "https://github.com/NVIDIA/apex" , check=True env = os.environ.copy env "APEX CPP EXT" = "1" env "APEX CUDA EXT" = "1" env "MAX JOBS" = "4" env "NVCC APPEND FLAGS" = "--threads 4" cmd = sys.executable, "-m", "pip", "install", "-v", "--no-build-isolation", "--no-cache-dir", "./apex" proc = subprocess.run cmd, env=env if proc.returncode = 0: print " apex CUDA build failed - falling back to PYTHON-ONLY install " " fused kernels will be unavailable, tutorial still runs ." subprocess.run sys.executable, "-m", "pip", "install", "-v", "--no-build-isolation", "--no-cache-dir", "./apex" , check=False if not module present "amp C" : build apex HAS AMP C = module present "amp C" HAS FLN = module present "fused layer norm cuda" try: import apex from apex.optimizers import FusedAdam from apex.normalization import FusedLayerNorm try: from apex.normalization import FusedRMSNorm HAS RMS = True except Exception: HAS RMS = False from apex import amp APEX OK = True except Exception as e: print f" apex import failed: {e}" APEX OK = False print "\n capabilities " print f" apex importable : {APEX OK}" print f" FusedAdam kernels : {HAS AMP C}" print f" FusedLayerNorm krnl: {HAS FLN}" print f" FusedRMSNorm : {APEX OK and HAS RMS}" print "=" 78 def bench fn, iters=50, warmup=10 : for in range warmup : fn torch.cuda.synchronize t0 = time.perf counter for in range iters : fn torch.cuda.synchronize return time.perf counter - t0 / iters 1e3 We start by preparing the CUDA environment, checking GPU availability, and printing the active PyTorch, CUDA, and GPU details. We then build NVIDIA Apex from source with CUDA and C++ extensions so that the fused kernels can be used directly rather than relying on a limited Python-only installation. We also detect whether FusedAdam, FusedLayerNorm, FusedRMSNorm, and legacy AMP are available, and define a reusable benchmarking helper for subsequent tests. python print "\n SECTION A: FusedAdam vs AdamW " def make many param model n layers=60, dim=512 : return torch.nn.Sequential torch.nn.Linear dim, dim for in range n layers .to DEV def opt step factory optimizer, model, dim=512 : x = torch.randn 64, dim, device=DEV def step : optimizer.zero grad set to none=True out = model x .pow 2 .mean out.backward optimizer.step return step m1 = make many param model torch adam = torch.optim.AdamW m1.parameters , lr=1e-3 ms torch = bench opt step factory torch adam, m1 print f" torch.optim.AdamW : {ms torch:6.2f} ms / step" if HAS AMP C and APEX OK: m2 = make many param model m2.load state dict m1.state dict fused adam = FusedAdam m2.parameters , lr=1e-3 ms fused = bench opt step factory fused adam, m2 print f" apex.FusedAdam : {ms fused:6.2f} ms / step " f" ~{ms torch/ms fused:0.2f}x on optimizer-bound step " else: print " apex.FusedAdam : SKIPPED cuda ext not built " We benchmark PyTorch AdamW against Apex FusedAdam using a model with many linear layers to make optimizer overhead visible. We run the same optimizer step pattern for both methods, so the comparison focuses on update speed rather than model differences. We then report the step time and speedup to assess whether the fused multi-tensor optimizer provides a practical benefit in the current GPU runtime. print "\n SECTION B: FusedLayerNorm / FusedRMSNorm " B, T, H = 32, 512, 1024 x = torch.randn B, T, H, device=DEV, requires grad=True torch ln = torch.nn.LayerNorm H .to DEV def ln torch : y = torch ln x ; y.sum .backward ms ln torch = bench ln torch print f" nn.LayerNorm : {ms ln torch:6.2f} ms / fwd+bwd" if HAS FLN and APEX OK: fused ln = FusedLayerNorm H .to DEV with torch.no grad : fused ln.weight.copy torch ln.weight ; fused ln.bias.copy torch ln.bias diff = fused ln x.detach - torch ln x.detach .abs .max .item print f" max|fused - torch| = {diff:.2e} should be ~1e-3 or smaller " def ln fused : y = fused ln x ; y.sum .backward ms ln fused = bench ln fused print f" apex.FusedLayerNorm: {ms ln fused:6.2f} ms / fwd+bwd " f" ~{ms ln torch/ms ln fused:0.2f}x " if HAS RMS: fused rms = FusedRMSNorm H .to DEV def rms fused : y = fused rms x ; y.sum .backward print f" apex.FusedRMSNorm : {bench rms fused :6.2f} ms / fwd+bwd " f" RMSNorm: no mean-subtraction, used by LLaMA-style models " else: print " apex.FusedLayerNorm: SKIPPED cuda ext not built " We compare the standard PyTorch LayerNorm with Apex FusedLayerNorm on a large tensor resembling transformer hidden states. We first check numerical correctness by copying the same affine parameters and measuring the maximum difference between fused and standard outputs. We then benchmark forward and backward passes and, when available, test FusedRMSNorm to demonstrate how Apex supports normalization layers used in LLaMA-style models. print "\n SECTION C: mixed precision apex.amp opt-levels, DEPRECATED " def tiny net : return torch.nn.Sequential torch.nn.Linear 256, 256 , torch.nn.ReLU , torch.nn.Linear 256, 256 , torch.nn.ReLU , torch.nn.Linear 256, 10 , .to DEV if APEX OK: for level in "O0", "O1", "O2" : net = tiny net optimizer = FusedAdam net.parameters , lr=1e-3 if HAS AMP C else torch.optim.AdamW net.parameters , lr=1e-3 net, optimizer = amp.initialize net, optimizer, opt level=level, verbosity=0 xb = torch.randn 128, 256, device=DEV yb = torch.randint 0, 10, 128, , device=DEV lossfn = torch.nn.CrossEntropyLoss for in range 20 : optimizer.zero grad loss = lossfn net xb , yb with amp.scale loss loss, optimizer as scaled loss: scaled loss.backward optimizer.step print f" opt level={level}: final loss = {loss.item :.4f}" else: print " apex.amp: SKIPPED apex not importable " print "\n Modern recommended equivalent torch.amp, no Apex needed :" net = tiny net optimizer = torch.optim.AdamW net.parameters , lr=1e-3 scaler = torch.amp.GradScaler "cuda" xb = torch.randn 128, 256, device=DEV ; yb = torch.randint 0, 10, 128, , device=DEV lossfn = torch.nn.CrossEntropyLoss for in range 20 : optimizer.zero grad with torch.amp.autocast "cuda", dtype=torch.float16 : loss = lossfn net xb , yb scaler.scale loss .backward scaler.step optimizer scaler.update print f" torch.amp: final loss = {loss.item :.4f}" We demonstrate the legacy apex.amp mixed-precision workflow by running small training loops across different opt levels, such as O0, O1, and O2. We use amp.initialize and amp.scale loss to show how Apex handles model wrapping and loss scaling in the older API. We then run the same kind of mixed precision training with modern torch.amp, which is the recommended approach for new PyTorch code. print "\n SECTION D: end-to-end Transformer vanilla fp32 vs Apex fused + AMP " VOCAB, D, NHEAD, LAYERS, SEQ, BATCH, STEPS = 2000, 256, 4, 4, 128, 32, 60 class Block torch.nn.Module : def init self, d, nhead, norm cls : super . init self.attn = torch.nn.MultiheadAttention d, nhead, batch first=True self.ff = torch.nn.Sequential torch.nn.Linear d, 4 d , torch.nn.GELU , torch.nn.Linear 4 d, d self.n1, self.n2 = norm cls d , norm cls d def forward self, x : h = self.n1 x ; x = x + self.attn h, h, h, need weights=False 0 return x + self.ff self.n2 x class TinyTransformer torch.nn.Module : def init self, norm cls : super . init self.emb = torch.nn.Embedding VOCAB, D self.blocks = torch.nn.ModuleList Block D, NHEAD, norm cls for in range LAYERS self.norm = norm cls D self.head = torch.nn.Linear D, VOCAB def forward self, idx : x = self.emb idx for b in self.blocks: x = b x return self.head self.norm x g = torch.Generator device="cpu" .manual seed 0 data = torch.randint 0, VOCAB, BATCH, SEQ + 1 , generator=g .to DEV inp, tgt = data :, :-1 , data :, 1: lossfn = torch.nn.CrossEntropyLoss def run training use apex : torch.manual seed 0 norm cls = FusedLayerNorm if use apex and HAS FLN and APEX OK else torch.nn.LayerNorm model = TinyTransformer norm cls .to DEV if use apex and HAS AMP C and APEX OK: optimizer = FusedAdam model.parameters , lr=3e-4 else: optimizer = torch.optim.AdamW model.parameters , lr=3e-4 scaler = torch.amp.GradScaler "cuda", enabled=use apex def one step : optimizer.zero grad set to none=True with torch.amp.autocast "cuda", dtype=torch.float16, enabled=use apex : logits = model inp loss = lossfn logits.reshape -1, VOCAB , tgt.reshape -1 scaler.scale loss .backward scaler.step optimizer scaler.update return loss for in range 5 : one step torch.cuda.synchronize t0 = time.perf counter for in range STEPS : loss = one step torch.cuda.synchronize dt = time.perf counter - t0 return loss.item , STEPS BATCH SEQ / dt, dt loss v, tps v, dt v = run training use apex=False print f" vanilla fp32, nn.LayerNorm, AdamW : " f"{dt v:5.2f}s | {tps v:9.0f} tok/s | final loss {loss v:.3f}" if APEX OK and HAS AMP C or HAS FLN : loss a, tps a, dt a = run training use apex=True print f" apex fp16, FusedLayerNorm, FusedAdam : " f"{dt a:5.2f}s | {tps a:9.0f} tok/s | final loss {loss a:.3f}" print f" ---- speedup: {tps a / tps v:0.2f}x throughput" else: print " apex path SKIPPED no fused kernels built " print "\n" + "=" 78 print "DONE. Key takeaways:" print " - FusedAdam/FusedLayerNorm/FusedRMSNorm are the still-relevant Apex pieces;" print " speedups grow with model size & parameter count tiny demo understates it ." print " - apex.amp is deprecated - prefer torch.amp.autocast + torch.amp.GradScaler." print " - FusedAdam composes cleanly with native torch.amp Section D ." print " - On real workloads, also try a larger model and bf16 autocast no scaler needed ." print "=" 78 We build a small Transformer with attention blocks, feed-forward layers, embeddings, and normalization to test Apex in an end-to-end training workload. We train it once with vanilla FP32 PyTorch using AdamW and standard LayerNorm, then train it again with fused Apex components and native PyTorch AMP when the kernels are available. We finally compare runtime, token throughput, final loss, and speedup to understand how fused kernels affect real training performance. In conclusion, we have a clear and practical understanding of where NVIDIA Apex still fits in a 2026 deep learning workflow. We saw that Apex is no longer primarily about mixed precision, since native PyTorch AMP now handles that aspect more cleanly. However, its fused optimizer and fused normalization kernels can still be useful when the environment supports a proper CUDA extension build. We also learned how to write Apex-aware code that does not break when fused kernels are unavailable, making the tutorial more reliable across Colab runtimes. The final Transformer benchmark gives us a complete view of how FusedAdam, FusedLayerNorm, and torch.amp can work together in an end-to-end training loop. Also, we used this tutorial to move beyond installation and API usage, and we evaluated Apex as it should be evaluated: by checking kernel availability, comparing against PyTorch baselines, and measuring performance in an actual training workload. Check out the Full Codes with Notebook . Also, feel free to follow us on Twitter and don’t forget to join our and Subscribe to 150k+ ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58 Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions. - Sana Hassan - Sana Hassan - Sana Hassan - Sana Hassan