PyTorch Custom Operation

PyTorch users can now implement custom operations in C++ and CUDA for use in both Python and C++ inference programs, with automatic device dispatch between CPU and CUDA implementations. The approach supports both stateless custom functions registered via `TORCH_LIBRARY_IMPL` and stateful custom classes using `torch::CustomClassHolder` that can hold parameters and be embedded in `torch.nn.Module` models. This enables developers to create high-performance custom operations that work seamlessly with PyTorch's AOTInductor compiled inference pipeline.

PyTorch Custom Operation Introduction Using PyTorch custom operations is common in PyTorch models. PyTorch custom operations can be custom classes and custom functions implemented in C++ and CUDA and used in both Python and C++ inference programs. In this blog post, I would like to share how to implement PyTorch custom operations in C++ and CUDA, and how to use them in PyTorch models and AOTInductor compiled inference programs, using a simple identity convolution example https://github.com/leimao/AOTInductor-Custom-Operator-Example . PyTorch Custom Function PyTorch custom functions can be implemented in C++ and CUDA and registered using the TORCH LIBRARY IMPL macro. Both the CPU and CUDA implementations can be provided, and PyTorch will dispatch to the correct implementation based on the device of the input tensors. 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152 | // ---------------------------------------------------------------------------// CPU implementation: plain element-wise copy via clone .// ---------------------------------------------------------------------------torch::Tensor identity conv cpu impl const torch::Tensor& input { TORCH CHECK input.is cuda , "identity conv cpu impl: input must be a CPU tensor" ; return input.clone ;}// ---------------------------------------------------------------------------// Host-side dispatcher.// ---------------------------------------------------------------------------torch::Tensor identity conv cuda impl const torch::Tensor& input { TORCH CHECK input.is cuda , "identity conv cuda impl: input must be a CUDA tensor" ; // Output has the same shape, dtype, and strides as input. auto output = torch::empty like input ; const int64 t numel = input.numel ; if numel == 0 return output; // Upload shape and strides to the device so the kernel can read them. const int ndim = input.dim ; const auto opts = torch::TensorOptions .dtype torch::kInt64 .device input.device ; const auto shape dev = torch::tensor std::vector<int64 t input.sizes .begin , input.sizes .end , opts ; const auto strides dev = torch::tensor std::vector<int64 t input.strides .begin , input.strides .end , opts ; constexpr int kThreads = 256; const int blocks = static cast<int numel + kThreads - 1 / kThreads ; AT DISPATCH FLOATING TYPES AND2 at::ScalarType::Half, at::ScalarType::BFloat16, input.scalar type , "identity conv cuda impl", & { identity kernel<scalar t <<<blocks, kThreads input.data ptr<scalar t , output.data ptr<scalar t , shape dev.data ptr<int64 t , strides dev.data ptr<int64 t , ndim, numel ; } ; C10 CUDA KERNEL LAUNCH CHECK ; return output;} | 1234567891011 | // CUDA kernel implementation for my ops::identity conv op.TORCH LIBRARY IMPL my ops, CUDA, m { m.impl "identity conv op", identity conv cuda impl ;}// CPU fallback.TORCH LIBRARY IMPL my ops, CPU, m { m.impl "identity conv op", identity conv cpu impl ;} | PyTorch Custom Class PyTorch custom functions are stateless and cannot hold any parameters. If we would like to implement a custom class that holds some parameters and has a forward method that can be called from Python, we can use torch::CustomClassHolder to define a custom class in C++ and register it with TORCH LIBRARY macro. 12345678910111213141516171819202122232425 | // ---------------------------------------------------------------------------// IdentityConvClass//// A custom class registered with torch.classes so that it can be embedded// in a torch.nn.Module, exported with torch.export, and compiled with// AOTInductor.//// The forward method delegates to the CUDA identity kernel. The// channels field is preserved for semantic completeness and is serialised// via def pickle so that the class survives export/import round-trips.// ---------------------------------------------------------------------------struct IdentityConvClass : torch::CustomClassHolder{ int64 t channels ; explicit IdentityConvClass int64 t channels : channels channels {} torch::Tensor forward const torch::Tensor& x { return x.is cuda ? identity conv cuda impl x : identity conv cpu impl x ; } int64 t get channels const { return channels ; }}; | 12345678910111213141516171819202122232425262728293031323334353637 | // ---------------------------------------------------------------------------// Operator / class registration//// This file has no pybind11 dependency and is compiled into// libidentity conv ops.so, which can be dlopen'd by a pure C++ binary// without needing libpython.// ---------------------------------------------------------------------------TORCH LIBRARY my ops, m { // Register IdentityConvClass so Python can instantiate it as // torch.classes.my ops.IdentityConvClass channels . m.class <IdentityConvClass "IdentityConvClass" .def torch::init<int64 t .def "forward", &IdentityConvClass::forward .def "get channels", &IdentityConvClass::get channels // obj flatten is called by torch.export's non-strict tracer on // the real C++ object before it switches to FakeTensor mode. // Must return a tuple of str, value pair-tuples so that // check valid flat script obj passes it checks isinstance item, // tuple for every element in the flat sequence . We encode channels // as a single named entry; there are no tensor leaves. .def " obj flatten ", const c10::intrusive ptr<IdentityConvClass & self { return std::make tuple std::make tuple std::string "channels" , self- channels ; } // def pickle enables TorchScript serialisation. .def pickle const c10::intrusive ptr<IdentityConvClass & self - int64 t { return self- channels ; }, int64 t channels - c10::intrusive ptr<IdentityConvClass { return c10::make intrusive<IdentityConvClass channels ; } ; // Register the schema for identity conv op. m.def "identity conv op Tensor x - Tensor" ;} | Using Custom Operations and Classes In PyTorch The PyTorch custom classes, functions, and their registrations in C++ are built into a shared library libidentity conv ops.so that can be loaded and registered in PyTorch using torch.ops.load library . For torch.compile and torch.export compatibility, we also need to register “fake” abstract versions of the custom classes and functions in PyTorch using @register fake class and @torch.library.register fake so that the FakeTensor-based symbolic tracing can work correctly without having to execute the actual C++/CUDA code during tracing. 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394 | """custom ops.py=============Loads the C++ / CUDA shared library and sets up all custom PyTorch operationsused by the IdentityModel: 1. torch.classes.my ops.IdentityConvClass registered by the shared library - A fake/abstract version is registered here so that torch.export can trace through module attributes that hold an instance of this class. 2. my ops::identity conv op schema + CPU + CUDA registered by the shared library - register fake: abstract implementation for torch.export / FakeTensor."""import osimport torchimport torch.library --------------------------------------------------------------------------- 1. Load the C++ / CUDA shared library. This triggers the TORCH LIBRARY my ops, ... static initialiser which registers torch.classes.my ops.IdentityConvClass into PyTorch's global operator registry. The library path can be overridden via the IDENTITY CONV OPS LIB environment variable; it defaults to ../ext/libidentity conv ops.so relative to this file. --------------------------------------------------------------------------- default lib = os.path.join os.path.dirname os.path.abspath file , "..", "ext", "libidentity conv ops.so" lib path = os.path.abspath os.environ.get "IDENTITY CONV OPS LIB", default lib torch.ops.load library lib path --------------------------------------------------------------------------- 2. Register a "fake" abstract version of IdentityConvClass for torch.export tracing. torch.export uses FakeTensor-based symbolic tracing. When it encounters a custom-class attribute on a module it looks for: • obj flatten - returns leaves, context for pytree flattening • obj unflatten - reconstructs the object from leaves, context These are provided by the @register fake class-decorated Python class. ---------------------------------------------------------------------------from torch. library.fake class registry import register fake class@register fake class "my ops::IdentityConvClass" class FakeIdentityConvClass: """Abstract counterpart of IdentityConvClass used during torch.export.""" def init self, channels: int - None: self.channels = channels -- pytree protocol required by torch.export ---------------------------- def obj flatten self : Must return a tuple of str, value pair-tuples, matching the C++ obj flatten which returns "channels", N , . return "channels", self.channels , @classmethod def obj unflatten cls, flat : flat is the possibly tensor-fakified sequence of key, value pairs produced by maybe to fake obj. Reconstruct from it. return cls dict flat "channels" -- abstract method implementations operate on FakeTensors ------------ def forward self, x: torch.Tensor - torch.Tensor: Shape / dtype mirrors the input - correct abstract behaviour. return torch.empty like x def get channels self - int: return self.channels --------------------------------------------------------------------------- 3. Register the fake abstract implementation of identity conv op for torch.export tracing. The schema and both implementations CUDA and CPU are already registered by the C++ extension via TORCH LIBRARY / TORCH LIBRARY IMPL. Python only needs to provide the abstract shape/dtype computation so that torch.export's FakeTensor interpreter can trace through the op. ---------------------------------------------------------------------------@torch.library.register fake "my ops::identity conv op" def identity conv op fake x: torch.Tensor - torch.Tensor: """Abstract implementation used by torch.export / FakeTensor tracing.""" return torch.empty like x Convenience alias so other modules can do: from custom ops import identity conv opidentity conv op = torch.ops.my ops.identity conv op | PyTorch custom classes can be loaded using torch.classes and PyTorch custom functions can be loaded using torch.ops after the shared library is loaded. 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119 | """model.py========Defines the four-layer IdentityModel used in the AOTInductor demo.Layer layout------------ layer1 : IdentityConv - native PyTorch operators layer2 : IdentityConvCustomClass - torch.classes C++/CUDA custom class layer3 : IdentityConvCustomOp - torch.library.custom op C++/CUDA op layer4 : IdentityConv - native PyTorch operatorsEvery layer is an identity transformation, so model x == x for any input x."""import torchimport torch.nn as nn Importing custom ops registers the C++ extension, the fake class, and the custom op - this must happen before any model is instantiated.from custom ops import identity conv op noqa: F401 --------------------------------------------------------------------------- Layer 1 / 4 - native PyTorch depthwise 1×1 convolution identity weights ---------------------------------------------------------------------------class IdentityConv nn.Module : """Identity convolution implemented with native PyTorch operators. Uses a depthwise Conv2d with kernel size=1 and weight=1.0, which is equivalent to a no-op output == input . This layer is compatible with torch.export and AOTInductor out of the box. """ def init self, channels: int - None: super . init self.conv = nn.Conv2d in channels=channels, out channels=channels, kernel size= 1, 1 , stride= 1, 1 , padding= 0, 0 , dilation= 1, 1 , groups=channels, bias=False, Set all weights to 1.0 so that the convolution acts as identity. self.conv.weight.data = torch.ones channels, 1, 1, 1 Freeze the weights - they are constants, not learnable parameters. self.conv.weight.requires grad = False def forward self, x: torch.Tensor - torch.Tensor: return self.conv x --------------------------------------------------------------------------- Layer 2 - custom C++/CUDA class via torch.classes ---------------------------------------------------------------------------class IdentityConvCustomClass nn.Module : """Identity convolution backed by a torch.classes C++/CUDA custom class. At runtime the forward call is dispatched to the CUDA kernel registered inside IdentityConvClass csrc/identity conv.cpp + .cu . For torch.export compatibility a FakeIdentityConvClass is registered in custom ops.py via @register fake class so that symbolic tracing works. """ def init self, channels: int - None: super . init self.obj = torch.classes.my ops.IdentityConvClass channels def forward self, x: torch.Tensor - torch.Tensor: return self.obj.forward x --------------------------------------------------------------------------- Layer 3 - custom C++/CUDA op via torch.library.custom op ---------------------------------------------------------------------------class IdentityConvCustomOp nn.Module : """Identity convolution backed by a torch.library.custom op C++/CUDA op. The op my ops::identity conv op is defined in custom ops.py with: • a register fake implementation for torch.export tracing • a register kernel "cuda" implementation that calls the CUDA kernel """ def init self, channels: int - None: super . init self.channels = channels def forward self, x: torch.Tensor - torch.Tensor: return identity conv op x --------------------------------------------------------------------------- Full model ---------------------------------------------------------------------------class IdentityModel nn.Module : """Four-layer identity model for AOTInductor demo.""" def init self, channels: int - None: super . init self.layer1 = IdentityConv channels self.layer2 = IdentityConvCustomClass channels self.layer3 = IdentityConvCustomOp channels self.layer4 = IdentityConv channels def forward self, x: torch.Tensor - torch.Tensor: x = self.layer1 x x = self.layer2 x x = self.layer3 x x = self.layer4 x return xdef create model channels: int = 3 - IdentityModel: """Return an IdentityModel in eval mode on the default CUDA device.""" return IdentityModel channels=channels .cuda .eval | PyTorch Model Export and Lowering The PyTorch model using custom classes and custom functions can be exported with torch.export if fake abstract versions of all custom classes and functions are registered for torch.export symbolic tracing. 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113 | bash /usr/bin/env python3"""export compile.py=================Exports the IdentityModel with torch.export and compiles it withtorch. inductor.aoti compile and package.The resulting package model.pt2 is written to the artifacts/ directoryand can be loaded by both run inference.py Python and the C++ inferencebinary.Usage run from the python/ directory : python export compile.py"""import osimport sys Ensure the python/ directory is on the path so that local modules are found.sys.path.insert 0, os.path.dirname os.path.abspath file import torch Importing custom ops loads the C++ extension and registers all custom ops.import custom ops noqa: F401from model import create model --------------------------------------------------------------------------- Configuration ---------------------------------------------------------------------------CHANNELS = 3BATCH SIZE = 1HEIGHT = 224WIDTH = 224 Save the compiled package in the artifacts/ directory at the project root.PACKAGE PATH = os.path.join os.path.dirname os.path.abspath file , "..", "artifacts", "model.pt2" def main - None: print "=" 64 print "AOTInductor - Export & Compile" print "=" 64 ------------------------------------------------------------------ Step 1: Instantiate model and verify correctness on eager execution ------------------------------------------------------------------ print f"\n 1/4 Creating IdentityModel channels={CHANNELS} ..." model = create model channels=CHANNELS x = torch.randn BATCH SIZE, CHANNELS, HEIGHT, WIDTH, device="cuda", dtype=torch.float32 with torch.no grad : out = model x assert torch.equal x, out , f"Eager pre-export check FAILED " f" max diff = { x - out .abs .max .item :.2e} " print " Eager verification PASSED bitwise identical " ------------------------------------------------------------------ Step 2: Export with torch.export ------------------------------------------------------------------ print "\n 2/4 Exporting model with torch.export.export ..." with torch.no grad : exported program = torch.export.export model, x, print " Export DONE" print f"\n Exported graph:\n{exported program.graph}" ------------------------------------------------------------------ Step 3: Compile with AOTInductor ------------------------------------------------------------------ print "\n 3/4 Compiling with torch. inductor.aoti compile and package ..." package path = torch. inductor.aoti compile and package exported program, package path=PACKAGE PATH, print f" Compilation DONE" print f" Package saved to: {os.path.abspath package path }" ------------------------------------------------------------------ Step 4: Quick sanity check - load the package and run inference ------------------------------------------------------------------ print "\n 4/4 Quick sanity check: loading package and running inference ..." compiled model = torch. inductor.aoti load package package path with torch.no grad : out compiled = compiled model x aoti load package returns a callable whose output is a list of tensors. if isinstance out compiled, list, tuple : out compiled = out compiled 0 assert torch.equal x, out compiled , f"Compiled model sanity check FAILED " f" max diff = { x - out compiled .abs .max .item :.2e} " print " Compiled model verification PASSED bitwise identical " print "\n" + "=" 64 print f"SUCCESS Package: {os.path.abspath package path }" print "=" 64 if name == " main ": main | From the exported graph we can see that the custom class IdentityConvClass.forward is represented as a call to torch.ops.higher order.call torchbind . The custom op identity conv op is represented as a call to torch.ops.my ops.identity conv op . 12345678910 | graph : %p layer1 conv weight : num users=1 = placeholder target=p layer1 conv weight %p layer4 conv weight : num users=1 = placeholder target=p layer4 conv weight %obj layer2 obj : num users=1 = placeholder target=obj layer2 obj %x : num users=1 = placeholder target=x %conv2d : num users=1 = call function target=torch.ops.aten.conv2d.default args = %x, %p layer1 conv weight, None, 1, 1 , 0, 0 , 1, 1 , 3 , kwargs = {} %call torchbind : num users=1 = call function target=torch.ops.higher order.call torchbind args = %obj layer2 obj, forward, %conv2d , kwargs = {} %identity conv op : num users=1 = call function target=torch.ops.my ops.identity conv op.default args = %call torchbind, , kwargs = {} %conv2d 1 : num users=1 = call function target=torch.ops.aten.conv2d.default args = %identity conv op, %p layer4 conv weight, None, 1, 1 , 0, 0 , 1, 1 , 3 , kwargs = {} return conv2d 1, | The exported program can be compiled and packaged with torch. inductor.aoti compile and package to produce a model.pt2 package that can be loaded by both Python and C++ inference programs. The custom class and custom op implementations will be loaded from the shared library and correctly dispatched at runtime when the compiled model is executed. 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116 | bash /usr/bin/env python3"""run inference.py================Loads the AOTInductor-compiled IdentityModel package model.pt2 and runsinference to verify correctness.The output of the identity model must equal the input within a tightfloating-point tolerance.Usage run from the python/ directory after export compile.py : python run inference.py MODEL PATH OP LIB PATH Arguments: MODEL PATH Path to the compiled model package .pt2 . Defaults to ../artifacts/model.pt2 relative to this script. OP LIB PATH Path to the custom-op shared library .so . When provided the library path is forwarded to custom ops.py via the IDENTITY CONV OPS LIB environment variable so that torch.ops.load library uses that file instead of the default ../ext/libidentity conv ops.so."""import osimport sys Ensure the python/ directory is on the path so that local modules are found.sys.path.insert 0, os.path.dirname os.path.abspath file import torchimport torch. inductor.codecache required before aoti load package --------------------------------------------------------------------------- Parse CLI arguments --------------------------------------------------------------------------- DEFAULT PACKAGE PATH = os.path.join os.path.dirname os.path.abspath file , "..", "artifacts", "model.pt2" PACKAGE PATH = sys.argv 1 if len sys.argv 1 else DEFAULT PACKAGE PATHOP LIB PATH = sys.argv 2 if len sys.argv 2 else None --------------------------------------------------------------------------- If an explicit library path was given, pass it to custom ops.py via an environment variable so that torch.ops.load library uses that file. ---------------------------------------------------------------------------if OP LIB PATH is not None: os.environ "IDENTITY CONV OPS LIB" = os.path.abspath OP LIB PATH Importing custom ops loads the shared library and registers all custom ops BEFORE the compiled model is loaded.import custom ops noqa: F401 --------------------------------------------------------------------------- Configuration - must match the values used in export compile.py ---------------------------------------------------------------------------CHANNELS = 3BATCH SIZE = 1HEIGHT = 224WIDTH = 224def main - None: print "=" 64 print "AOTInductor - Python Inference" print "=" 64 ------------------------------------------------------------------ Step 1: Load the compiled model package ------------------------------------------------------------------ pkg = os.path.abspath PACKAGE PATH if OP LIB PATH is not None: print f" Op library : {os.path.abspath OP LIB PATH }" print f"\n 1/3 Loading compiled model from:\n {pkg}" compiled model = torch. inductor.aoti load package pkg print " Model loaded successfully." ------------------------------------------------------------------ Step 2: Prepare input ------------------------------------------------------------------ x = torch.randn BATCH SIZE, CHANNELS, HEIGHT, WIDTH, device="cuda", dtype=torch.float32 print f"\n 2/3 Input shape={list x.shape } dtype={x.dtype} " f"device={x.device}" ------------------------------------------------------------------ Step 3: Run inference and verify ------------------------------------------------------------------ print "\n 3/3 Running inference ..." with torch.no grad : output = compiled model x aoti load package returns a callable whose output is a list of tensors. if isinstance output, list, tuple : output = output 0 print f" Output shape={list output.shape } dtype={output.dtype}" if torch.equal x, output : print "\n Verification PASSED bitwise identical " else: max diff = x - output .abs .max .item print f"\n Verification FAILED max diff = {max diff} " f" — expected bitwise identical output" sys.exit 1 print "\n" + "=" 64 print "SUCCESS AOTInductor Python inference verified." print "=" 64 if name == " main ": main | The custom class and custom function shared library loading and registration can be performed using dlopen in a pure C++ inference program without any pybind11 or libpython dependency. 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153 | / main.cpp ======== C++ inference program for the AOTInductor-compiled IdentityModel. Prerequisites ------------- • Build libidentity conv ops.so top-level CMakeLists.txt first. • Run export compile.py to produce model.pt2. The custom operator library libidentity conv ops.so must be loaded before the compiled model is executed so that torch.classes.my ops and my ops::identity conv op are present in the operator registry. libidentity conv ops.so has no pybind11 dependency and does not link libtorch python.so, so no libpython pre-loading is required. Usage ----- ./run inference <path/to/model.pt2 <path/to/libidentity conv ops.so Verification ------------ The model is an identity transform, so output should equal the random input within floating-point rounding tolerance < 1e-5 . / include <cstdlib include <iostream include <stdexcept include <string include <vector include <dlfcn.h // dlopen / dlclose include <torch/csrc/inductor/aoti package/model package loader.h include <torch/torch.h // ---------------------------------------------------------------------------// Hyper-parameters - must match the values used in export compile.py// ---------------------------------------------------------------------------static constexpr int64 t kBatchSize = 1;static constexpr int64 t kChannels = 3;static constexpr int64 t kHeight = 224;static constexpr int64 t kWidth = 224;int main int argc, char argv { if argc < 3 { std::cerr << "Usage: " << argv 0 << " <path/to/model.pt2 " << " <path/to/libidentity conv ops.so " << std::endl; return EXIT FAILURE; } const std::string model path = argv 1 ; const std::string custom op lib = argv 2 ; std::cout << "================================================\n" << "AOTInductor - C++ Inference\n" << "================================================\n"; try { // ------------------------------------------------------------------ // Step 1: Load the custom operator shared library. // // libidentity conv ops.so contains only TORCH LIBRARY registrations // and the CPU/CUDA kernels. It has no pybind11 dependency and does // not link libtorch python.so, so no libpython pre-loading is needed. // ------------------------------------------------------------------ std::cout << "\n 1/4 Loading custom op library:\n " << custom op lib << std::endl; void lib handle = dlopen custom op lib.c str , RTLD NOW | RTLD GLOBAL ; if lib handle { throw std::runtime error std::string "dlopen failed: " + dlerror ; } std::cout << " Library loaded." << std::endl; // ------------------------------------------------------------------ // Step 2: Load the compiled model package. // // AOTIModelPackageLoader unpacks the .pt2 archive and prepares the // AOTIModelContainerRunner for the target device. // ------------------------------------------------------------------ std::cout << "\n 2/4 Loading model package:\n " << model path << std::endl; torch::inductor::AOTIModelPackageLoader loader model path ; auto runner = loader.get runner ; std::cout << " Model loaded." << std::endl; // ------------------------------------------------------------------ // Step 3: Prepare a random input tensor on CUDA. // ------------------------------------------------------------------ auto options = torch::TensorOptions .dtype torch::kFloat32 .device torch::kCUDA, 0 ; auto input = torch::randn {kBatchSize, kChannels, kHeight, kWidth}, options ; std::cout << "\n 3/4 Input shape= " << kBatchSize << ", " << kChannels << ", " << kHeight << ", " << kWidth << " " << " dtype=float32 device=cuda" << std::endl; // ------------------------------------------------------------------ // Step 4: Run inference and verify correctness. // ------------------------------------------------------------------ std::cout << "\n 4/4 Running inference ..." << std::endl; std::vector<at::Tensor inputs = {input}; auto outputs = runner- run inputs ; const auto& output = outputs 0 ; bool passed = input.equal output ; float max diff = input - output .abs .max .item<float ; std::cout << " Output shape= " << output.size 0 << ", " << output.size 1 << ", " << output.size 2 << ", " << output.size 3 << " " << std::endl; std::cout << " Max |input - output| = " << max diff << std::endl; dlclose lib handle ; if passed { std::cout << "\n Verification PASSED bitwise identical " << std::endl; } else { std::cerr << "\n Verification FAILED max diff = " << max diff << " " << std::endl; return EXIT FAILURE; } } catch const std::exception& e { std::cerr << "\nError: " << e.what << std::endl; return EXIT FAILURE; } std::cout << "\n================================================\n" << "SUCCESS AOTInductor C++ inference verified.\n" << "================================================\n"; return EXIT SUCCESS;} | References PyTorch Custom Operation