PyTorch AOTInductor Hybrid Lowering

PyTorch AOTInductor now compiles exported programs with hybrid CPU-GPU execution plans into a single executable package, eliminating the need to manually split models into separate device sub-models. The `aoti_compile_and_package` API produces a unified package that runs the entire model end-to-end, with operators executing on their designated devices and automatic device transfers handled internally. This capability enables developers to deploy models with mixed-device execution without requiring separate compilation or runtime orchestration for each device.

PyTorch AOTInductor Hybrid Lowering Introduction In my previous blog post “PyTorch Fake Export” /blog/PyTorch-Fake-Export/ , I mentioned that PyTorch exported program allows operators on different devices and explicit device transfer operators in the same graph. This means that a PyTorch exported program can have a hybrid device execution plan, where some operators are executed on CPU and some operators are executed on GPU, with explicit device transfers in between. In this blog post, I would like to discuss how PyTorch AOTInductor can compile a PyTorch exported program with a hybrid device execution plan into a single executable package that can run the whole model end-to-end, with part of the operators running on CPU and part of the operators running on GPU, without needing to manually split the model into separate CPU and GPU sub-models. PyTorch AOTInductor Hybrid Lowering The following example follows the one used in my previous blog post “PyTorch Fake Export” /blog/PyTorch-Fake-Export/ . However, instead of exporting fake models, this time we have to export the model with actual data. The example input tensors can remain fake tensors though. The exported programs are compiled with AOTInductor using the torch. inductor.aoti compile and package API, which produces a single executable package for each exported program. The compiled AOTInductor package can be loaded with torch. inductor.aoti load package API, which returns a Python callable that can be invoked with real input tensors to run the model end-to-end. 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132 | python from pathlib import Pathimport torchimport torch.nn as nnimport torch.profilerfrom torch. inductor import aoti compile and package, aoti load packagefrom torch. subclasses.fake tensor import FakeTensorModeclass MLP nn.Module : """MLP configurable across CPU, GPU, or a CPU-GPU hybrid split. fc1 + GELU is placed on fc1 device ; fc2 is placed on fc2 device . When the two devices differ the forward pass inserts an explicit device transfer, preserved as an aten. to copy node in the exported graph. When they are the same the transfer is a no-op. """ def init self, in features: int, hidden features: int, out features: int, fc1 device: torch.device = torch.device "cpu" , fc2 device: torch.device = torch.device "cpu" - None: super . init with torch.device fc1 device : self.fc1 = nn.Linear in features, hidden features self.act = nn.GELU with torch.device fc2 device : self.fc2 = nn.Linear hidden features, out features def forward self, x: torch.Tensor - torch.Tensor: h = self.act self.fc1 x Transfer to fc2's device no-op when fc1 and fc2 share the same device . h = h.to self.fc2.weight.device return self.fc2 h def aoti compile model: nn.Module, x: torch.Tensor, package path: str - object: """Export and AOTInductor-compile a model. A fake input with the same shape/dtype/device as x is used so that torch.export can trace the graph without allocating real activation memory. Works for any device cpu, cuda and any model topology. """ with FakeTensorMode : fake input = torch.empty x.shape, dtype=x.dtype, device=x.device ep = torch.export.export model, fake input, compiled package = aoti compile and package ep, package path=package path return aoti load package compiled package, run single threaded=True def profile runner runner, x: torch.Tensor, trace path: str, label: str, warmup: int = 3, steps: int = 5 - None: """Profile an AOTI runner and export a Chrome trace to trace path . Note: AOTI runners call compiled C++ directly, bypassing the ATen dispatcher's profiling hooks. As a result, no cpu op events e.g. aten::mm, aten::gelu appear in the trace — the runner executes as an opaque native call from the profiler's perspective. What the trace does capture are CUDA runtime events cudaLaunchKernel and, when CUPTI is available, actual GPU kernel execution on the device timeline. """ activities = torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, schedule = torch.profiler.schedule wait=0, warmup=warmup, active=steps, repeat=1 with torch.profiler.profile activities=activities, schedule=schedule, record shapes=True, with flops=True, as prof: for step in range warmup + steps : with torch.profiler.record function f"step {step}" : runner x prof.step prof.export chrome trace trace path print f"{label} trace written to {trace path}" if name == " main ": cpu device = torch.device "cpu" gpu device = torch.device "cuda" artifacts dir = Path file .parent / "aoti artifacts" artifacts dir.mkdir exist ok=True model cpu = MLP 128, 256, 10, fc1 device=cpu device, fc2 device=cpu device .eval x cpu = torch.randn 4, 128, device=cpu device runner cpu = aoti compile model cpu, x cpu, str artifacts dir / "cpu.pt2" torch.testing.assert close runner cpu x cpu , model cpu x cpu print "AOTInductor compile CPU succeeded." profile runner runner cpu, x cpu, str artifacts dir / "cpu trace.json" , "AOTInductor CPU " model gpu = MLP 128, 256, 10, fc1 device=gpu device, fc2 device=gpu device .eval x cuda = torch.randn 4, 128, device=gpu device runner gpu = aoti compile model gpu, x cuda, str artifacts dir / "cuda.pt2" torch.testing.assert close runner gpu x cuda , model gpu x cuda print "AOTInductor compile GPU succeeded." profile runner runner gpu, x cuda, str artifacts dir / "cuda trace.json" , "AOTInductor GPU " model hybrid = MLP 128, 256, 10, fc1 device=cpu device, fc2 device=gpu device .eval x hybrid = torch.randn 4, 128, device=cpu device runner hybrid = aoti compile model hybrid, x hybrid, str artifacts dir / "hybrid.pt2" torch.testing.assert close runner hybrid x hybrid , model hybrid x hybrid print "AOTInductor compile CPU-GPU hybrid succeeded." profile runner runner hybrid, x hybrid, str artifacts dir / "hybrid trace.json" , "AOTInductor CPU-GPU hybrid " | Using NVIDIA NGC Docker container nvcr.io/nvidia/pytorch:26.04-py3 , we could run the above script to export the MLP concrete model for CPU, GPU, and CPU-GPU hybrid device configurations, and compile the exported programs with AOTInductor. The CPU and GPU hybrid execution of the hybrid AOTInductor engine can be verified by examining the profiling traces. 1234567891011121314151617 | bash $ python test torch hybrid lowering.py/usr/lib/python3.12/copyreg.py:99: FutureWarning: isinstance treespec, LeafSpec is deprecated, use isinstance treespec, TreeSpec and treespec.is leaf instead. return cls. new cls, args /usr/local/lib/python3.12/dist-packages/torch/utils/ config module.py:540: FutureWarning: torch. dynamo.config.skip code recursive on recompile limit hit is deprecated and does not do anything. It will be removed in a future version of PyTorch. config key = copy.deepcopy getattr self, key AOTInductor compile CPU succeeded./usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py:229: UserWarning: Warning: Profiler clears events at the end of each cycle.Only events from the current cycle will be reported.To keep events across cycles, set acc events=True. warn once AOTInductor CPU trace written to /mnt/aoti artifacts/cpu trace.json/usr/lib/python3.12/copyreg.py:99: FutureWarning: isinstance treespec, LeafSpec is deprecated, use isinstance treespec, TreeSpec and treespec.is leaf instead. return cls. new cls, args AOTInductor compile GPU succeeded.AOTInductor GPU trace written to /mnt/aoti artifacts/cuda trace.json/usr/lib/python3.12/copyreg.py:99: FutureWarning: isinstance treespec, LeafSpec is deprecated, use isinstance treespec, TreeSpec and treespec.is leaf instead. return cls. new cls, args AOTInductor compile CPU-GPU hybrid succeeded.AOTInductor CPU-GPU hybrid trace written to /mnt/aoti artifacts/hybrid trace.json | References PyTorch AOTInductor Hybrid Lowering https://leimao.github.io/blog/PyTorch-AOTInductor-Hybrid-Lowering/ https://leimao.github.io/blog/PyTorch-AOTInductor-Hybrid-Lowering/