{"slug": "pytorch-fake-export", "title": "PyTorch Fake Export", "summary": "PyTorch introduced a \"fake export\" method that allows developers to verify the exportability of large deep learning models using `torch.export` APIs without requiring actual GPU memory. The approach uses fake tensors within `FakeTensorMode` to simulate model parameters on specific devices, enabling developers to test GPU export compatibility even when models are too large to fit on a single GPU. This technique addresses the fundamental difference between CPU and GPU exported programs, providing a practical verification path for multi-device model deployment.", "body_md": "# PyTorch Fake Export\n\nIntroduction\n\nPyTorch `torch.export`\n\nAPIs produces a standardized, single-graph representation of a deep learning model designed for deployment in Python-less environments. Unlike other model export APIs and graph representations, such as `torch.onnx.export`\n\nAPIs and ONNX, PyTorch exported program is not a pure device graph. In one PyTorch exported program, the graph can contain operators on different devices, and even explicit device transfer operators. Consequently, there is a difference between the CPU exported program and GPU exported program of the same PyTorch model.\n\nDeep learning models are getting much larger and complex nowadays, and it is often common that a model cannot fit on a single GPU. On host, usually there is plenty of CPU memory to fit and run one large model. So a natural question is, how can the developer verify that the large model being developed can be successfully exported to a GPU exported program using `torch.export`\n\nAPIs. Certainly, moving the model to CPU and run `torch.export`\n\non CPU is an incorrect way to verify, even if the model can be run successfully on CPU, because of the difference between CPU and GPU exported programs. To address this problem, PyTorch provides a way to construct a fake model whose parameters are fake tensors on specific devices, such as CPU or GPU, that have no actual data. To verify the exportability of a large fake model, the developer can also use fake tensors as example inputs to run `torch.export`\n\nAPIs for tracing.\n\nIn this blog post, I would like to discuss how to run [PyTorch export](/blog/PyTorch-Custom-ONNX-Operator-Export/) for fake models with fake tensors for verifying the `torch.export`\n\ncompatibility of a large model.\n\nPyTorch Fake Export\n\nIn the following example, we define a simple MLP model with two linear layers and a GELU activation in between. The first linear layer and the activation are placed on `fc1_device`\n\n, and the second linear layer is placed on `fc2_device`\n\n. When executing the model, the output of the activation is transferred to the device of the second linear layer before being fed into it.\n\nTo instantiate a fake model, we create the model inside `FakeTensorMode`\n\n. To specify the device placement of the model, we can use `torch.device`\n\ncontext manager when constructing the linear layers, instead of using the PyTorch `to`\n\nAPI because the `to`\n\nmove API would require accessing the actual data of tensors.\n\n```\n12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273\n```\n\n | \n\n``` python\nimport torchimport torch.nn as nnfrom torch._subclasses.fake_tensor import FakeTensor, FakeTensorModeclass MLP(nn.Module):    \"\"\"MLP configurable across CPU, GPU, or a CPU-GPU hybrid split.    fc1 (+ GELU) is placed on *fc1_device*; fc2 is placed on *fc2_device*.    When the two devices differ the forward pass inserts an explicit device    transfer, preserved as an aten._to_copy node in the exported graph.    When they are the same the transfer is a no-op.    \"\"\"    def __init__(        self,        in_features: int,        hidden_features: int,        out_features: int,        fc1_device: torch.device = torch.device(\"cpu\"),        fc2_device: torch.device = torch.device(\"cpu\")    ) -> None:        super().__init__()        with torch.device(fc1_device):            self.fc1 = nn.Linear(in_features, hidden_features)            self.act = nn.GELU()        with torch.device(fc2_device):            self.fc2 = nn.Linear(hidden_features, out_features)    def forward(self, x: torch.Tensor) -> torch.Tensor:        h = self.act(self.fc1(x))        # Transfer to fc2's device (no-op when fc1 and fc2 share the same device).        h = h.to(self.fc2.weight.device)        return self.fc2(h)def fake_export(fc1_device: torch.device,                fc2_device: torch.device) -> torch.export.ExportedProgram:    \"\"\"Export the MLP with fake tensors for the given device configuration.    Both parameters and the example input are fake tensors created inside    FakeTensorMode, so no real memory is allocated on either device.    \"\"\"    with FakeTensorMode():        model = MLP(in_features=128,                    hidden_features=256,                    out_features=10,                    fc1_device=fc1_device,                    fc2_device=fc2_device).eval()        assert all(isinstance(p, FakeTensor) for p in model.parameters(        )), \"Model parameters were unexpectedly materialized (not FakeTensor)\"        example_input = torch.randn(4, 128, device=fc1_device)    return torch.export.export(model, (example_input, ))if __name__ == \"__main__\":    cpu_device = torch.device(\"cpu\")    gpu_device = torch.device(\"cuda\")    # PyTorch export specializes the graph on the device configuration.    ep_cpu = fake_export(fc1_device=cpu_device, fc2_device=cpu_device)    print(\"MLP export (CPU) succeeded.\")    print(ep_cpu)    ep_gpu = fake_export(fc1_device=gpu_device, fc2_device=gpu_device)    print(\"MLP export (GPU) succeeded.\")    print(ep_gpu)    ep_hybrid = fake_export(fc1_device=cpu_device, fc2_device=gpu_device)    print(\"CPU-GPU hybrid export succeeded.\")    # The graph contains both cpu and cuda ops plus an aten._to_copy transfer node.    print(ep_hybrid)\n```\n\n |\n\nUsing NVIDIA NGC Docker container `nvcr.io/nvidia/pytorch:26.04-py3`\n\n, we could run the above script and see the successful export of the MLP fake model for CPU, GPU, and CPU-GPU hybrid device configurations. There will be no actual data allocated for the model parameters and the example input during the export, thanks to the use of fake tensors.\n\n```\n123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293\n```\n\n | \n\n``` bash\n$ python test_torch_fake_export.pyMLP export (CPU) succeeded.ExportedProgram:    class GraphModule(torch.nn.Module):        def forward(self, p_fc1_weight: \"f32[256, 128]\", p_fc1_bias: \"f32[256]\", p_fc2_weight: \"f32[10, 256]\", p_fc2_bias: \"f32[10]\", x: \"f32[4, 128]\"):            # File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear(input, self.weight, self.bias)            linear: \"f32[4, 256]\" = torch.ops.aten.linear.default(x, p_fc1_weight, p_fc1_bias);  x = p_fc1_weight = p_fc1_bias = None            # File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/activation.py:816 in forward, code: return F.gelu(input, approximate=self.approximate)            gelu: \"f32[4, 256]\" = torch.ops.aten.gelu.default(linear);  linear = None            # File: /mnt/test_torch_fake_export.py:33 in forward, code: h = h.to(self.fc2.weight.device)            _assert_tensor_metadata_default = torch.ops.aten._assert_tensor_metadata.default(gelu, dtype = torch.float32, device = device(type='cpu'), layout = torch.strided);  _assert_tensor_metadata_default = None            to: \"f32[4, 256]\" = torch.ops.aten.to.dtype_layout(gelu, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'));  gelu = None            # File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear(input, self.weight, self.bias)            linear_1: \"f32[4, 10]\" = torch.ops.aten.linear.default(to, p_fc2_weight, p_fc2_bias);  to = p_fc2_weight = p_fc2_bias = None            return (linear_1,)Graph signature:    # inputs    p_fc1_weight: PARAMETER target='fc1.weight'    p_fc1_bias: PARAMETER target='fc1.bias'    p_fc2_weight: PARAMETER target='fc2.weight'    p_fc2_bias: PARAMETER target='fc2.bias'    x: USER_INPUT    # outputs    linear_1: USER_OUTPUTRange constraints: {}MLP export (GPU) succeeded.ExportedProgram:    class GraphModule(torch.nn.Module):        def forward(self, p_fc1_weight: \"f32[256, 128]\", p_fc1_bias: \"f32[256]\", p_fc2_weight: \"f32[10, 256]\", p_fc2_bias: \"f32[10]\", x: \"f32[4, 128]\"):            # File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear(input, self.weight, self.bias)            linear: \"f32[4, 256]\" = torch.ops.aten.linear.default(x, p_fc1_weight, p_fc1_bias);  x = p_fc1_weight = p_fc1_bias = None            # File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/activation.py:816 in forward, code: return F.gelu(input, approximate=self.approximate)            gelu: \"f32[4, 256]\" = torch.ops.aten.gelu.default(linear);  linear = None            # File: /mnt/test_torch_fake_export.py:33 in forward, code: h = h.to(self.fc2.weight.device)            _assert_tensor_metadata_default = torch.ops.aten._assert_tensor_metadata.default(gelu, dtype = torch.float32, device = device(type='cuda', index=0), layout = torch.strided);  _assert_tensor_metadata_default = None            to: \"f32[4, 256]\" = torch.ops.aten.to.dtype_layout(gelu, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0));  gelu = None            # File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear(input, self.weight, self.bias)            linear_1: \"f32[4, 10]\" = torch.ops.aten.linear.default(to, p_fc2_weight, p_fc2_bias);  to = p_fc2_weight = p_fc2_bias = None            return (linear_1,)Graph signature:    # inputs    p_fc1_weight: PARAMETER target='fc1.weight'    p_fc1_bias: PARAMETER target='fc1.bias'    p_fc2_weight: PARAMETER target='fc2.weight'    p_fc2_bias: PARAMETER target='fc2.bias'    x: USER_INPUT    # outputs    linear_1: USER_OUTPUTRange constraints: {}CPU-GPU hybrid export succeeded.ExportedProgram:    class GraphModule(torch.nn.Module):        def forward(self, p_fc1_weight: \"f32[256, 128]\", p_fc1_bias: \"f32[256]\", p_fc2_weight: \"f32[10, 256]\", p_fc2_bias: \"f32[10]\", x: \"f32[4, 128]\"):            # File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear(input, self.weight, self.bias)            linear: \"f32[4, 256]\" = torch.ops.aten.linear.default(x, p_fc1_weight, p_fc1_bias);  x = p_fc1_weight = p_fc1_bias = None            # File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/activation.py:816 in forward, code: return F.gelu(input, approximate=self.approximate)            gelu: \"f32[4, 256]\" = torch.ops.aten.gelu.default(linear);  linear = None            # File: /mnt/test_torch_fake_export.py:33 in forward, code: h = h.to(self.fc2.weight.device)            _assert_tensor_metadata_default = torch.ops.aten._assert_tensor_metadata.default(gelu, dtype = torch.float32, device = device(type='cpu'), layout = torch.strided);  _assert_tensor_metadata_default = None            to: \"f32[4, 256]\" = torch.ops.aten.to.dtype_layout(gelu, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0));  gelu = None            # File: /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:134 in forward, code: return F.linear(input, self.weight, self.bias)            linear_1: \"f32[4, 10]\" = torch.ops.aten.linear.default(to, p_fc2_weight, p_fc2_bias);  to = p_fc2_weight = p_fc2_bias = None            return (linear_1,)Graph signature:    # inputs    p_fc1_weight: PARAMETER target='fc1.weight'    p_fc1_bias: PARAMETER target='fc1.bias'    p_fc2_weight: PARAMETER target='fc2.weight'    p_fc2_bias: PARAMETER target='fc2.bias'    x: USER_INPUT    # outputs    linear_1: USER_OUTPUTRange constraints: {}\n```\n\n |\n\nPyTorch Fake Export", "url": "https://wpnews.pro/news/pytorch-fake-export", "canonical_source": "https://leimao.github.io/blog/PyTorch-Fake-Export/", "published_at": "2026-05-17 07:00:00+00:00", "updated_at": "2026-06-06 22:37:53.381846+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "ai-tools", "ai-infrastructure"], "entities": ["PyTorch", "ONNX"], "alternates": {"html": "https://wpnews.pro/news/pytorch-fake-export", "markdown": "https://wpnews.pro/news/pytorch-fake-export.md", "text": "https://wpnews.pro/news/pytorch-fake-export.txt", "jsonld": "https://wpnews.pro/news/pytorch-fake-export.jsonld"}}