PyTorch Fake Export

PyTorch introduced a "fake export" method that allows developers to verify the exportability of large deep learning models using `torch.export` APIs without requiring actual GPU memory. The approach uses fake tensors within `FakeTensorMode` to simulate model parameters on specific devices, enabling developers to test GPU export compatibility even when models are too large to fit on a single GPU. This technique addresses the fundamental difference between CPU and GPU exported programs, providing a practical verification path for multi-device model deployment.

PyTorch Fake Export Introduction PyTorch torch.export APIs produces a standardized, single-graph representation of a deep learning model designed for deployment in Python-less environments. Unlike other model export APIs and graph representations, such as torch.onnx.export APIs and ONNX, PyTorch exported program is not a pure device graph. In one PyTorch exported program, the graph can contain operators on different devices, and even explicit device transfer operators. Consequently, there is a difference between the CPU exported program and GPU exported program of the same PyTorch model. Deep learning models are getting much larger and complex nowadays, and it is often common that a model cannot fit on a single GPU. On host, usually there is plenty of CPU memory to fit and run one large model. So a natural question is, how can the developer verify that the large model being developed can be successfully exported to a GPU exported program using torch.export APIs. Certainly, moving the model to CPU and run torch.export on CPU is an incorrect way to verify, even if the model can be run successfully on CPU, because of the difference between CPU and GPU exported programs. To address this problem, PyTorch provides a way to construct a fake model whose parameters are fake tensors on specific devices, such as CPU or GPU, that have no actual data. To verify the exportability of a large fake model, the developer can also use fake tensors as example inputs to run torch.export APIs for tracing. In this blog post, I would like to discuss how to run PyTorch export /blog/PyTorch-Custom-ONNX-Operator-Export/ for fake models with fake tensors for verifying the torch.export compatibility of a large model. PyTorch Fake Export In the following example, we define a simple MLP model with two linear layers and a GELU activation in between. The first linear layer and the activation are placed on fc1 device , and the second linear layer is placed on fc2 device . When executing the model, the output of the activation is transferred to the device of the second linear layer before being fed into it. To instantiate a fake model, we create the model inside FakeTensorMode . To specify the device placement of the model, we can use torch.device context manager when constructing the linear layers, instead of using the PyTorch to API because the to move API would require accessing the actual data of tensors. 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273 | python import torchimport torch.nn as nnfrom torch. subclasses.fake tensor import FakeTensor, FakeTensorModeclass MLP nn.Module : """MLP configurable across CPU, GPU, or a CPU-GPU hybrid split. fc1 + GELU is placed on fc1 device ; fc2 is placed on fc2 device . When the two devices differ the forward pass inserts an explicit device transfer, preserved as an aten. to copy node in the exported graph. When they are the same the transfer is a no-op. """ def init self, in features: int, hidden features: int, out features: int, fc1 device: torch.device = torch.device "cpu" , fc2 device: torch.device = torch.device "cpu" - None: super . init with torch.device fc1 device : self.fc1 = nn.Linear in features, hidden features self.act = nn.GELU with torch.device fc2 device : self.fc2 = nn.Linear hidden features, out features def forward self, x: torch.Tensor - torch.Tensor: h = self.act self.fc1 x Transfer to fc2's device no-op when fc1 and fc2 share the same device . h = h.to self.fc2.weight.device return self.fc2 h def fake export fc1 device: torch.device, fc2 device: torch.device - torch.export.ExportedProgram: """Export the MLP with fake tensors for the given device configuration. Both parameters and the example input are fake tensors created inside FakeTensorMode, so no real memory is allocated on either device. """ with FakeTensorMode : model = MLP in features=128, hidden features=256, out features=10, fc1 device=fc1 device, fc2 device=fc2 device .eval assert all isinstance p, FakeTensor for p in model.parameters , "Model parameters were unexpectedly materialized not FakeTensor " example input = torch.randn 4, 128, device=fc1 device return torch.export.export model, example input, if name == " main ": cpu device = torch.device "cpu" gpu device = torch.device "cuda" PyTorch export specializes the graph on the device configuration. ep cpu = fake export fc1 device=cpu device, fc2 device=cpu device print "MLP export CPU succeeded." print ep cpu ep gpu = fake export fc1 device=gpu device, fc2 device=gpu device print "MLP export GPU succeeded." print ep gpu ep hybrid = fake export fc1 device=cpu device, fc2 device=gpu device print "CPU-GPU hybrid export succeeded." The graph contains both cpu and cuda ops plus an aten. to copy transfer node. print ep hybrid | Using NVIDIA NGC Docker container nvcr.io/nvidia/pytorch:26.04-py3 , we could run the above script and see the successful export of the MLP fake model for CPU, GPU, and CPU-GPU hybrid device configurations. There will be no actual data allocated for the model parameters and the example input during the export, thanks to the use of fake tensors. 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293 | bash $ python test torch fake export.pyMLP export CPU succeeded.ExportedProgram: class GraphModule torch.nn.Module : def forward self, p fc1 weight: "f32 256, 128 ", p fc1 bias: "f32 256 ", p fc2 weight: "f32 10, 256 ", p fc2 bias: "f32 10 ", x: "f32 4, 128 " : File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear input, self.weight, self.bias linear: "f32 4, 256 " = torch.ops.aten.linear.default x, p fc1 weight, p fc1 bias ; x = p fc1 weight = p fc1 bias = None File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/activation.py:816 in forward, code: return F.gelu input, approximate=self.approximate gelu: "f32 4, 256 " = torch.ops.aten.gelu.default linear ; linear = None File: /mnt/test torch fake export.py:33 in forward, code: h = h.to self.fc2.weight.device assert tensor metadata default = torch.ops.aten. assert tensor metadata.default gelu, dtype = torch.float32, device = device type='cpu' , layout = torch.strided ; assert tensor metadata default = None to: "f32 4, 256 " = torch.ops.aten.to.dtype layout gelu, dtype = torch.float32, layout = torch.strided, device = device type='cpu' ; gelu = None File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear input, self.weight, self.bias linear 1: "f32 4, 10 " = torch.ops.aten.linear.default to, p fc2 weight, p fc2 bias ; to = p fc2 weight = p fc2 bias = None return linear 1, Graph signature: inputs p fc1 weight: PARAMETER target='fc1.weight' p fc1 bias: PARAMETER target='fc1.bias' p fc2 weight: PARAMETER target='fc2.weight' p fc2 bias: PARAMETER target='fc2.bias' x: USER INPUT outputs linear 1: USER OUTPUTRange constraints: {}MLP export GPU succeeded.ExportedProgram: class GraphModule torch.nn.Module : def forward self, p fc1 weight: "f32 256, 128 ", p fc1 bias: "f32 256 ", p fc2 weight: "f32 10, 256 ", p fc2 bias: "f32 10 ", x: "f32 4, 128 " : File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear input, self.weight, self.bias linear: "f32 4, 256 " = torch.ops.aten.linear.default x, p fc1 weight, p fc1 bias ; x = p fc1 weight = p fc1 bias = None File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/activation.py:816 in forward, code: return F.gelu input, approximate=self.approximate gelu: "f32 4, 256 " = torch.ops.aten.gelu.default linear ; linear = None File: /mnt/test torch fake export.py:33 in forward, code: h = h.to self.fc2.weight.device assert tensor metadata default = torch.ops.aten. assert tensor metadata.default gelu, dtype = torch.float32, device = device type='cuda', index=0 , layout = torch.strided ; assert tensor metadata default = None to: "f32 4, 256 " = torch.ops.aten.to.dtype layout gelu, dtype = torch.float32, layout = torch.strided, device = device type='cuda', index=0 ; gelu = None File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear input, self.weight, self.bias linear 1: "f32 4, 10 " = torch.ops.aten.linear.default to, p fc2 weight, p fc2 bias ; to = p fc2 weight = p fc2 bias = None return linear 1, Graph signature: inputs p fc1 weight: PARAMETER target='fc1.weight' p fc1 bias: PARAMETER target='fc1.bias' p fc2 weight: PARAMETER target='fc2.weight' p fc2 bias: PARAMETER target='fc2.bias' x: USER INPUT outputs linear 1: USER OUTPUTRange constraints: {}CPU-GPU hybrid export succeeded.ExportedProgram: class GraphModule torch.nn.Module : def forward self, p fc1 weight: "f32 256, 128 ", p fc1 bias: "f32 256 ", p fc2 weight: "f32 10, 256 ", p fc2 bias: "f32 10 ", x: "f32 4, 128 " : File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear input, self.weight, self.bias linear: "f32 4, 256 " = torch.ops.aten.linear.default x, p fc1 weight, p fc1 bias ; x = p fc1 weight = p fc1 bias = None File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/activation.py:816 in forward, code: return F.gelu input, approximate=self.approximate gelu: "f32 4, 256 " = torch.ops.aten.gelu.default linear ; linear = None File: /mnt/test torch fake export.py:33 in forward, code: h = h.to self.fc2.weight.device assert tensor metadata default = torch.ops.aten. assert tensor metadata.default gelu, dtype = torch.float32, device = device type='cpu' , layout = torch.strided ; assert tensor metadata default = None to: "f32 4, 256 " = torch.ops.aten.to.dtype layout gelu, dtype = torch.float32, layout = torch.strided, device = device type='cuda', index=0 ; gelu = None File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear input, self.weight, self.bias linear 1: "f32 4, 10 " = torch.ops.aten.linear.default to, p fc2 weight, p fc2 bias ; to = p fc2 weight = p fc2 bias = None return linear 1, Graph signature: inputs p fc1 weight: PARAMETER target='fc1.weight' p fc1 bias: PARAMETER target='fc1.bias' p fc2 weight: PARAMETER target='fc2.weight' p fc2 bias: PARAMETER target='fc2.bias' x: USER INPUT outputs linear 1: USER OUTPUTRange constraints: {} | PyTorch Fake Export