Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile NVIDIA released CUDA 13.3, adding support for writing tile-based GPU kernels in C++ through the CUDA Tile programming model. The new capability allows developers to build highly optimized kernels using tile abstractions that automatically leverage NVIDIA hardware features including tensor cores and shared memory, without requiring direct hardware targeting. This expansion from the previously Python-only support enables integration of tile-based programming into existing C++ GPU codebases. Developers can now use NVIDIA CUDA Tile programming within large existing C++ GPU codebases to develop highly optimized GPU kernels using tile-based abstractions. NVIDIA CUDA Tile, launched with NVIDIA CUDA 13.1 https://developer.nvidia.com/blog/nvidia-cuda-13-1-powers-next-gen-gpu-programming-with-nvidia-cuda-tile-and-performance-gains , introduced tile-based programming https://developer.nvidia.com/blog/focus-on-your-algorithm-nvidia-cuda-tile-handles-the-hardware for GPUs. Designed with a top-level language layer and another intermediate layer that any high-level programming language can target, CUDA Tile automatically makes use of the advanced capabilities of NVIDIA hardware—including tensor cores, shared memory, and tensor memory accelerators—without requiring the application to target them directly. Python was the first language supported for tile-based GPU applications. The newly released CUDA 13.3 https://developer.nvidia.com/blog/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler-autotuning-and-python-updates adds support for writing tile kernels in C++, enabling developers to build highly optimized GPU kernels. What is CUDA Tile C++? CUDA Tile C++ https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-tile-kernels.html is an expression of the CUDA Tile programming model in C++, built on top of the CUDA Tile IR specification https://docs.nvidia.com/cuda/tile-ir/latest/ . It enables developers to write tile kernels in C++ and express GPU kernels using a tile-based model, rather than or in addition to a single instruction, multiple threads SIMT model. As a refresher, in the tile model: - Multi-dimensional arrays are the primary data storage. - Tiles are portions of arrays that kernels operate on. - Kernels are functions that are executed in parallel by blocks. - Blocks are subsets of the GPU; operations on tiles are parallelized across all the threads in each block. CUDA Tile C++ automates parallelism within blocks, along with asynchrony, memory movement, and other low-level details of GPU programming. CUDA Tile C++ is portable across different NVIDIA GPU architectures, enabling developers to use the latest hardware features without having to rewrite code. CUDA Tile C++ vector add example Developers familiar with CUDA C++ for SIMT have likely encountered the canonical vector addition kernel. Assuming the data is already on the GPU, a vector add kernel in CUDA SIMT takes two vectors and adds them together element-wise to produce a third vector. This is one of the simplest CUDA kernels to write. It looks as follows. global void vecAdd float A, float B, float C, int vectorLength { / calculate my thread index / int workIndex = threadIdx.x + blockIdx.x blockDim.x; if workIndex < vectorLength { / perform the vector addition / C workIndex = A workIndex + B workIndex ; } } In this kernel, each thread’s work is explicitly specified, and the programmer, when launching this kernel, will specify the number of blocks and threads to be launched. Looking at the equivalent code written in CUDA Tile C++, there’s no need to specify what each thread does. Just break the data into tiles and specify the mathematical operations for these tiles. Everything else is handled. The CUDA Tile C++ kernel looks like the following: include "cuda tile.h" tile global void vectorAdd float a, float b, float out, size t n { / set up the namespace / namespace ct = cuda::tiles; using namespace ct::literals; / attach shape to raw pointers / auto aSpan = ct::tensor span{a, ct::extents{n}}; auto bSpan = ct::tensor span{b, ct::extents{n}}; auto oSpan = ct::tensor span{out, ct::extents{n}}; / partition each span into tiles of size 8 / auto aView = ct::partition view{aSpan, ct::shape{8 ic}}; auto bView = ct::partition view{bSpan, ct::shape{8 ic}}; auto oView = ct::partition view{oSpan, ct::shape{8 ic}}; / load the a and b tiles from global memory / int bx = ct::bid .x; auto aTile = aView.load bx ; // load bx-th tile auto bTile = bView.load bx ; / add the two tiles together, elementwise / auto oTile = aTile + bTile; / store the result tile to the output partition. / oView.store oTile, bx ; } This looks like a lot of code for a simple vectorAdd kernel. Don’t be alarmed. This overly verbose kernel is used to show all the steps in order. A simplified version doing the same thing with fewer lines of code follows. - The first difference is using tile global to signify to the compiler that this is a tile kernel. The array pointers and the array size are passed in as arguments, just as they are in the SIMT kernel. tile global void vectorAdd float a, float b, float out, std::size t n { - Then, set up the namespace for cuda::tiles and ct::literals . namespace ct = cuda::tiles; using namespace ct::literals; - Create a tensor span, using this code ct::tensor span for each of the three arrays. A tensor span is essentially a pointer to a multi-dimensional array in memory, similar to a C++23 std::mdspan . The tensor span carries information about the shape extents of the array as well as the layout of the array elements for example, row major or column major . The ct::extents{} tells the tensor span what the dimensions of the array are. A 1D array uses n . auto aSpan = ct::tensor span{a, ct::extents{n}}; auto bSpan = ct::tensor span{b, ct::extents{n}}; auto oSpan = ct::tensor span{out, ct::extents{n}} - Now create a partition view from a tensor span and a tile shape. A partition view is a wrapper around a tensor span that presents the array as a series of non-overlapping, fixed-sized partitions. The size of each partition is specified by the shape argument, which must be a compile-time argument. In this example, 8 ic is an integer constant that is defined by ct::literals.ct::shape<8 {} and ct::shape{8 ic} are equivalent in this context. The partition view that’s created is essentially the original array, sliced into chunks of 8, which is the tile size. auto aView = ct::partition view{aSpan, ct::shape{8 ic}}; auto bView = ct::partition view{bSpan, ct::shape{8 ic}}; auto oView = ct::partition view{oSpan, ct::shape{8 ic}}; - Load input tiles by obtaining the block index in the X dimension with ct::bid .x. If working with multi-dimensional blocks, use the Y and Z dimensions as well. Then load the a and b tiles. Use auto for convenience, but to be explicit, aTile and bTile are of type ct::tile