PyTorch Custom Operation PyTorch users can now implement custom operations in C++ and CUDA for use in both Python and C++ inference programs, with automatic device dispatch between CPU and CUDA implementations. The approach supports both stateless custom functions registered via `TORCH_LIBRARY_IMPL` and stateful custom classes using `torch::CustomClassHolder` that can hold parameters and be embedded in `torch.nn.Module` models. This enables developers to create high-performance custom operations that work seamlessly with PyTorch's AOTInductor compiled inference pipeline. PyTorch Custom Operation Introduction Using PyTorch custom operations is common in PyTorch models. PyTorch custom operations can be custom classes and custom functions implemented in C++ and CUDA and used in both Python and C++ inference programs. In this blog post, I would like to share how to implement PyTorch custom operations in C++ and CUDA, and how to use them in PyTorch models and AOTInductor compiled inference programs, using a simple identity convolution example https://github.com/leimao/AOTInductor-Custom-Operator-Example . PyTorch Custom Function PyTorch custom functions can be implemented in C++ and CUDA and registered using the TORCH LIBRARY IMPL macro. Both the CPU and CUDA implementations can be provided, and PyTorch will dispatch to the correct implementation based on the device of the input tensors. 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152 | // ---------------------------------------------------------------------------// CPU implementation: plain element-wise copy via clone .// ---------------------------------------------------------------------------torch::Tensor identity conv cpu impl const torch::Tensor& input { TORCH CHECK input.is cuda , "identity conv cpu impl: input must be a CPU tensor" ; return input.clone ;}// ---------------------------------------------------------------------------// Host-side dispatcher.// ---------------------------------------------------------------------------torch::Tensor identity conv cuda impl const torch::Tensor& input { TORCH CHECK input.is cuda , "identity conv cuda impl: input must be a CUDA tensor" ; // Output has the same shape, dtype, and strides as input. auto output = torch::empty like input ; const int64 t numel = input.numel ; if numel == 0 return output; // Upload shape and strides to the device so the kernel can read them. const int ndim = input.dim ; const auto opts = torch::TensorOptions .dtype torch::kInt64 .device input.device ; const auto shape dev = torch::tensor std::vector