{"slug": "accelerating-transformers-fine-tuning-with-nvidia-nemo-automodel", "title": "Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel", "summary": "NVIDIA released NeMo AutoModel, an open-source library that accelerates fine-tuning of Mixture-of-Experts (MoE) transformer models by 3.4-3.7x in training throughput and reduces GPU memory usage by 29-32% compared to native HuggingFace Transformers v5, while maintaining API compatibility. The library builds on Transformers v5's MoE support, adding Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels to optimize distributed training for models like NVIDIA Nemotron and Qwen3.", "body_md": "Text Generation • 32B • Updated • 1.2M • 773\n\n# Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel\n\n[Enterprise + Article](/blog)\n\n[Transformers v5](https://github.com/huggingface/transformers/releases/tag/v5.0.0)release strengthened it with first-class support for Mixture-of-Experts (MoE) models, now the dominant architecture for\n\n[frontier models](https://www.nvidia.com/en-us/glossary/frontier-models/). v5 ships the MoE foundations: expert backends, dynamic weight loading, and distributed execution that make MoE extensible and easy to build on.\n\n[NVIDIA NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) is an open library part of the [NVIDIA NeMo framework](https://github.com/NVIDIA-NeMo) for building custom generative AI models at scale. NeMo AutoModel builds cleanly on top of v5, adding Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels, and it leans on v5's dynamic weight loading to bring those optimizations to a broad and growing set of model families. The payoff is **3.4-3.7x higher training throughput** and **29-32% less GPU memory** on fine-tuning MoE models than native Transformers v5, using the same from_pretrained() API: a single import line, with no other code changes.\n\nThis blog details how this combination works and how users can fine-tune MoE models faster without changing their APIs.\n\n# Background\n\nThe rise of MoE models has introduced new challenges to efficient training: Routing tokens across hundreds of experts, fusing expert matmuls into a single kernel, sharding weights across GPUs, and overlapping communication with computation all require infrastructure beyond what a general-purpose library provides out of the box.\n\n[Transformers v5](https://github.com/huggingface/transformers/releases/tag/v5.0.0) (“v5”) introduced first-class MoE support such as [expert backends](https://huggingface.co/docs/transformers/en/experts_interface), [dynamic weight loading](https://huggingface.co/docs/transformers/en/weightconverter), and tensor parallel plans for distributed execution. In addition, v5 made distributed training first-class by integrating PyTorch's DeviceMesh directly into from_pretrained().\n\n[NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) builds on top of v5 by subclassing AutoModelForCausalLM, and adding Expert Parallelism (EP), DeepEP fused all-to-all dispatch, and TransformerEngine kernels. DeepEP is the piece v5 doesn't have yet: it overlaps communication with expert compute. And because NeMo AutoModel rides v5's reversible weight conversion to load each model, it can focus its engineering on these reusable core ops instead of per-model checkpoint plumbing, while save_pretrained() still emits standard HF checkpoints that tools like vLLM and SGLang can load.\n\nThe next section walks through how the two work together and the performance gains we measured, from full fine-tuning [NVIDIA Nemotron 3 Ultra 550B A55B](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16) across 16 nodes down to single-node models such as Qwen3-30B-A3B and [Nemotron 3 Nano 30B A3B](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16).\n\n## NeMo AutoModel: Same API, More Performance\n\nOne of NeMo AutoModel's goals is API compatibility with HuggingFace Transformers to enable open-source community. NeMoAutoModelForCausalLM subclasses AutoModelForCausalLM, so any code that works with HF models works with AutoModel too.\n\nHere's what loading a model looks like in both. Only the import changes:\n\nThat single import does a lot of work. For popular MoE architectures like Qwen3, [NVIDIA Nemotron](https://developer.nvidia.com/nemotron), GPT-OSS, and DeepSeek V3, NeMo AutoModel ships [hand-tuned implementations](https://github.com/NVIDIA-NeMo/Automodel/blob/main/nemo_automodel/_transformers/registry.py) with TransformerEngine attention, fused linear layers, and custom expert kernels. For everything else, it falls back to vanilla HF while still applying optimizations like [Liger kernel](https://github.com/linkedin/Liger-Kernel) patching, among others. And whichever path it takes, the resulting model is ready to scale: pass a device_mesh and you have multi-GPU training without further rewrites.\n\nWhere NeMo AutoModel really shines is scaling MoE models to multi-GPU training. To train [Nemotron 3 Nano 30B A3B](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) with Expert Parallelism across 8 GPUs, one adds the distributed mesh configuration:\n\n``` python\nimport os\nimport torch\nimport torch.distributed as dist\nfrom nemo_automodel import NeMoAutoModelForCausalLM\nfrom nemo_automodel.recipes._dist_utils import create_distributed_setup_from_config\n\ndist.init_process_group(backend=\"nccl\")\ntorch.manual_seed(0)\ntorch.cuda.set_device(int(os.environ.get(\"LOCAL_RANK\", 0)))\n\ndist_setup = create_distributed_setup_from_config(\n    {\n        \"strategy\": \"fsdp2\",\n        \"ep_size\": 8,\n    },\n)\n\nmodel = NeMoAutoModelForCausalLM.from_pretrained(\n    \"nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16\",\n    dtype=torch.bfloat16,\n    distributed_setup=dist_setup,\n)\n\ndist.destroy_process_group()\n```\n\nThis gives speed, scalability and memory-optimizations with FSDP2, Expert Parallelism, TransformerEngine kernels and DeepEP dispatch, all from a from_pretrained() call.\n\n## Performance Comparison\n\nWe evaluated NeMo AutoModel in two regimes: full fine-tuning a frontier-scale 550B model across 16 nodes, and training two 30B MoE models on a single node. The 550B result shows why Expert Parallelism is essential at scale; the 30B results quantify the per-GPU speedup over Transformers v5.\n\n### Nemotron 3 Ultra 550B A55B (full fine-tune, multi-node)\n\n[Nemotron 3 Ultra 550B A55B](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16) is a 550B-parameter hybrid model shipping with Mamba2, LatentMoE, and Multi-Token Prediction (MTP). We benchmark a **full fine-tune**: every parameter is updated and the Adam optimizer state is materialized, which at this scale spans **16 H100 nodes (128 GPUs)**.\n\n**Methodology:**\n\n| Parameter | Value |\n|---|---|\n| Hardware | 16x H100 80GB (128 GPUs) |\n| Expert Parallelism | EP=64 |\n| Local batch size | 2 |\n| Sequence length | 4,096 |\n| Features | MTP, activation checkpointing, fused linear cross-entropy |\n| Kernels | DeepEP dispatch + torch_mm experts + TransformerEngine |\n\n| Metric | NeMo AutoModel (EP=64) |\n|---|---|\n| TPS/GPU (avg) | 815 |\n| TFLOP/s/GPU | ~293 |\n| Peak Memory | 58.2 GiB |\n\n**Why there is no Transformers v5 column.** Transformers v5 runs out of memory at this scale, so there is no v5 number to report here. AutoModel's Expert Parallelism shards the experts across GPUs to bring the footprint within budget, which is what lets the full fine-tune run. The 30B comparisons below show the same advantage where v5 fits.\n\n### Single-node 30B MoE benchmarks\n\nWe benchmarked three approaches on a single node with 8x H100 80GB GPUs: HF Transformers v4 (hub code), HF Transformers v5 (with best available optimizations), and NeMo AutoModel (EP=8 + custom kernels).\n\n**Methodology:**\n\n| Parameter | Value |\n|---|---|\n| Hardware | 8x H100 80GB (single node) |\n| Sequence length | 4,096 |\n| Local batch size | 1 |\n\n**A note on the routing gate.** The NeMo AutoModel numbers below use a balanced routing gate, which forces tokens to be distributed uniformly across experts. This emulates the *ideal* operating point an MoE is trained toward: a well-trained model's load-balancing loss drives expert utilization to near-uniform, so balanced routing reflects the steady-state a real workload converges to (and removes the straggler noise that random dummy tokens otherwise inject into expert parallelism). v4/v5 run their native router on the same dummy tokens. The balanced gate therefore measures NeMo AutoModel at its target MoE operating point, and the v4/v5 columns reflect their out-of-the-box behavior.\n\n### Qwen3-30B-A3B\n\n| Metric | v4 | v5 (FA2 + grouped_mm) | NeMo AutoModel (EP=8) | v5 → NeMo AutoModel |\n|---|---|---|---|---|\n| TPS/GPU (avg) | deadlock | 3,075 | 11,340 | 3.69x |\n| Peak Memory | — | 68.2 GiB | 48.1 GiB | -29% |\n| Avg Forward+Loss | — | 582 ms | 194 ms | 3.00x |\n| Avg Backward | — | 758 ms | 178 ms | 4.26x |\n\n**Why v4 deadlocks:** Transformers v4 stores Qwen3 MoE experts as a ModuleList of 128 individual MLP modules, each separately FSDP-wrapped. The forward pass uses a data-dependent loop that only iterates experts that received tokens. With different data per rank, different ranks skip different experts, causing mismatched FSDP AllGather/ReduceScatter collectives and an indefinite hang. Transformers v5 fixes this by storing experts as fused 3D parameter tensors (no per-expert modules, no per-expert FSDP collectives).\n\n### Nemotron 3 Nano 30B A3B\n\n| Metric | v4 (hub code) | v5 (FA2 + grouped_mm + Mamba CUDA) | NeMo AutoModel (EP=8) | v5 → NeMo AutoModel |\n|---|---|---|---|---|\n| TPS/GPU (avg) | 1,807 | 4,583 | 15,421 | 3.36x |\n| Peak Memory | 61.9 GiB | 62.1 GiB | 42.5 GiB | -32% |\n| Avg Forward+Loss | 1,024 ms | 283 ms | 109 ms | 2.60x |\n| Avg Backward | 1,246 ms | 611 ms | 157 ms | 3.89x |\n\n**v4 config:** trust_remote_code=True (NVIDIA's hub modeling code). The hub code's expert loop is FSDP-safe (iterates all experts regardless of token assignment), so it doesn't deadlock like Qwen3 v4.\n\n### Where the speedup comes from\n\nThe 3.4-3.7x speedup from NeMo AutoModel over Transformers v5 comes from three sources:\n\n**Expert Parallelism reduces memory pressure.** EP=8 distributes expert weights across GPUs, cutting the per-GPU MoE footprint by 8x. For Qwen3, this drops peak memory from 68.2 GiB to 48.1 GiB (-29%). For Nemotron Nano, it drops from 62.1 GiB to 42.5 GiB (-32%), freeing headroom for larger batch sizes or longer sequences.**DeepEP fuses communication with computation.** Instead of separate AllGather/ReduceScatter collectives for expert routing, DeepEP fuses token dispatch and combines into optimized GPU kernels, overlapping communication with expert computation.**TransformerEngine kernels accelerate core operations.** TE's fused attention, linear layers, and RMSNorm implementations provide consistent speedups over their PyTorch/Flash Attention equivalents across all layer types, not just MoE layers.\n\n## Transformers v5 Features Leveraged by HuggingFace AutoModel\n\n### Expert Backends\n\nOne of the most impactful features in Transformers v5 is the [experts_implementation](https://huggingface.co/docs/transformers/en/experts_interface) parameter, which includes three expert backends:\n\n| Backend | Description | Best for |\n|---|---|---|\n| eager | For-loop over selected experts | Debugging, compatibility, and correctness. Also available for v4. |\n| batched_mm | Duplicates expert params, single batched GEMM via torch.bmm | Small inputs, fast with torch.compile. Added for v5 |\n| grouped_mm | Orders tokens by expert, single grouped GEMM via torch.nn.functional.grouped_mm | Training (memory efficient, no param duplication). Added for v5. |\n\nThe grouped_mm backend is the key training optimization: instead of looping over experts one by one, it sorts tokens by their assigned expert and executes a single fused grouped matrix multiplication.\n\nNeMo AutoModel takes this further. For models with custom implementations, it uses DeepEP fused all-to-all dispatch combined with grouped GEMM kernels and TransformerEngine linear layers. The progression looks like:\n\n```\nv4 (eager for-loop) → v5 (grouped_mm) → NeMo AutoModel (DeepEP + GMM + TE)\n```\n\nIn NeMo AutoModel, the expert backend is configured through BackendConfig:\n\n``` python\nfrom nemo_automodel.components.models.common.utils import BackendConfig\n\nbackend = BackendConfig(\n    attn=\"te\",           # TransformerEngine attention\n    linear=\"te\",         # TransformerEngine linear layers\n    experts=\"torch_mm\",  # Grouped expert matmul\n    dispatcher=\"deepep\", # DeepEP fused all-to-all\n)\n```\n\n## Expert Parallelism and DeepEP\n\nTransformers v5 also ships an [Expert Parallelism path](https://huggingface.co/docs/transformers/en/expert_parallelism). It shards expert weights across GPUs. The [GroupedGemmParallel](https://github.com/huggingface/transformers/blob/v5.10.2/src/transformers/integrations/tensor_parallel.py#L1078) style loads only each device's local experts, and [RouterParallel](https://github.com/huggingface/transformers/blob/v5.10.2/src/transformers/integrations/tensor_parallel.py#L1123) routes tokens and combines results with an all_reduce. It's neatly built on v5's existing tensor-parallel machinery. Enabling it makes the model's tp_plan return its [expert plan](https://github.com/huggingface/transformers/blob/v5.10.2/src/transformers/modeling_utils.py#L1448), so expert parallelism shares the device budget with data parallelism (ep × dp = world_size). For the single-node 30B benchmarks here, we found plain data-parallel v5 (dp=8, ep=1) to be the fastest v5 configuration, so that's the v5 setup we report.\n\nNeMo AutoModel takes a complementary approach tuned for multi-GPU MoE training. It makes EP its own parallelism dimension, a dedicated moe_mesh alongside (rather than carved from) the data-parallel mesh, using PyTorch's DTensor with Shard(0). Because the expert mesh is orthogonal to data parallelism, the two compose on the same devices. On 8 GPUs NeMo AutoModel runs ep=8 and dp=8 together, so every GPU trains on its own data shard while holding only 1/8 of the experts. Expert weights are physically sharded across GPUs along the expert dimension.\n\n```\n# From nemo_automodel/components/moe/parallelizer.py\nfrom torch.distributed.tensor import Shard, distribute_tensor\n\n# Each GPU holds only 1/ep_size of the expert weights\ndistribute_tensor(param, device_mesh, [Shard(0)])\n```\n\nWith ep_size=8 on 8 GPUs, each GPU holds only 1/8 of the expert parameters. For a model like Nemotron-3-Nano-30B-A3B with ~55 GiB of expert weights, EP reduces the per-GPU expert footprint from ~55 GiB to ~6.8 GiB, making training possible where FSDP-only approaches run out of memory.\n\nOn top of EP, NeMo AutoModel integrates [DeepEP](https://github.com/deepseek-ai/DeepEP) that fuses the token routing into optimized GPU kernels, and delivers significant speedups when combined with grouped GEMM for grouped expert computation. In our [large-scale MoE benchmarks](https://github.com/NVIDIA-NeMo/Automodel/discussions/916), DeepEP + grouped GEMM reduced cost per iteration by 47% on the full DeepSeek V3 671B model compared to all-gather + looped expert baselines.\n\n### Dynamic Weight Loading\n\nTransformers v5 also introduced a [dynamic weight loading](https://huggingface.co/docs/transformers/en/weightconverter) system through WeightConverter and WeightRenaming. This enables MoE checkpoint to be stored in fused 3D tensors for more efficient execution. The WeightConverter applies composable operations to transform checkpoint tensors on-the-fly during from_pretrained().\n\nNeMo AutoModel is a direct consumer of this v5 API. Over [20 model types](https://github.com/NVIDIA-NeMo/Automodel/blob/main/nemo_automodel/components/checkpoint/conversion_mapping.py) use this mechanism through MODELS_REQUIRING_TENSOR_MERGING, including Mixtral, Qwen2 MoE, Qwen3 MoE, DeepSeek V2/V3, OLMoE, and more. The conversions are fully reversible: save_pretrained() produces standard HF-format checkpoints that any downstream tool can load.\n\n## Getting Started\n\nTo try NeMo AutoModel, please visit our official documentation page to [get started](https://docs.nvidia.com/nemo/automodel/latest/get-started/installation).\n\nFor more details, see:\n\n[NeMo AutoModel HuggingFace API Compatibility Guide](https://docs.nvidia.com/nemo/automodel/latest/get-started/hf-compatibility)[NeMo AutoModel Model Coverage](https://docs.nvidia.com/nemo/automodel/latest/model-coverage/overview)[NeMo AutoModel Performance Summary](https://docs.nvidia.com/nemo/automodel/latest/performance/performance-summary)[NeMo AutoModel on HuggingFace](https://huggingface.co/docs/transformers/en/community_integrations/nemo_automodel_finetuning)\n\n## Conclusion\n\nNVIDIA NeMo AutoModel is the natural next step for HuggingFace users scaling up model training. By building directly on Transformers v5, AutoModel provides a zero-friction upgrade path: change one import line and get a model instance that is more than three times as fast.\n\nOn Qwen3-30B-A3B and Nemotron 3 Nano 30B-A3B, this delivers 3.4-3.7x higher training throughput with 29-32% less GPU memory compared to the best Transformers v5 configuration. And because true Expert Parallelism shards experts across GPUs, the same path scales up to full fine-tuning a 550B model like Nemotron 3 Ultra across 16 nodes, the regime where Expert Parallelism becomes essential to fit the model in memory. Because NeMo AutoModel checkpoints are standard HF-format safetensors, you can deploy them on inference frameworks like vLLM and SGLang.\n\nThe code, configs, and benchmark scripts are all available in the [NeMo AutoModel repository](https://github.com/NVIDIA-NeMo/Automodel/tree/blog/transformers-v5-automodel/blog_experiments).\n\n## Acknowledgements\n\nCore contributors to this work, listed alphabetically by last name: Adil Asif, Hemil Desai, Alexandros Koumparoulis, and Huiying Li.", "url": "https://wpnews.pro/news/accelerating-transformers-fine-tuning-with-nvidia-nemo-automodel", "canonical_source": "https://huggingface.co/blog/nvidia/accelerating-fine-tuning-nvidia-nemo-automodel", "published_at": "2026-06-24 16:00:13+00:00", "updated_at": "2026-06-24 16:11:33.724647+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "ai-research", "ai-products"], "entities": ["NVIDIA", "NeMo AutoModel", "HuggingFace Transformers", "DeepEP", "TransformerEngine", "Qwen3", "Nemotron", "Mixture-of-Experts"], "alternates": {"html": "https://wpnews.pro/news/accelerating-transformers-fine-tuning-with-nvidia-nemo-automodel", "markdown": "https://wpnews.pro/news/accelerating-transformers-fine-tuning-with-nvidia-nemo-automodel.md", "text": "https://wpnews.pro/news/accelerating-transformers-fine-tuning-with-nvidia-nemo-automodel.txt", "jsonld": "https://wpnews.pro/news/accelerating-transformers-fine-tuning-with-nvidia-nemo-automodel.jsonld"}}