Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile

NVIDIA released CUDA 13.3, adding support for writing tile-based GPU kernels in C++ through the CUDA Tile programming model. The new capability allows developers to build highly optimized kernels using tile abstractions that automatically leverage NVIDIA hardware features including tensor cores and shared memory, without requiring direct hardware targeting. This expansion from the previously Python-only support enables integration of tile-based programming into existing C++ GPU codebases.

Developers can now use NVIDIA CUDA Tile programming within large existing C++ GPU codebases to develop highly optimized GPU kernels using tile-based abstractions. NVIDIA CUDA Tile, launched with NVIDIA CUDA 13.1 https://developer.nvidia.com/blog/nvidia-cuda-13-1-powers-next-gen-gpu-programming-with-nvidia-cuda-tile-and-performance-gains , introduced tile-based programming https://developer.nvidia.com/blog/focus-on-your-algorithm-nvidia-cuda-tile-handles-the-hardware for GPUs. Designed with a top-level language layer and another intermediate layer that any high-level programming language can target, CUDA Tile automatically makes use of the advanced capabilities of NVIDIA hardware—including tensor cores, shared memory, and tensor memory accelerators—without requiring the application to target them directly. Python was the first language supported for tile-based GPU applications. The newly released CUDA 13.3 https://developer.nvidia.com/blog/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler-autotuning-and-python-updates adds support for writing tile kernels in C++, enabling developers to build highly optimized GPU kernels. What is CUDA Tile C++? CUDA Tile C++ https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-tile-kernels.html is an expression of the CUDA Tile programming model in C++, built on top of the CUDA Tile IR specification https://docs.nvidia.com/cuda/tile-ir/latest/ . It enables developers to write tile kernels in C++ and express GPU kernels using a tile-based model, rather than or in addition to a single instruction, multiple threads SIMT model. As a refresher, in the tile model: - Multi-dimensional arrays are the primary data storage. - Tiles are portions of arrays that kernels operate on. - Kernels are functions that are executed in parallel by blocks. - Blocks are subsets of the GPU; operations on tiles are parallelized across all the threads in each block. CUDA Tile C++ automates parallelism within blocks, along with asynchrony, memory movement, and other low-level details of GPU programming. CUDA Tile C++ is portable across different NVIDIA GPU architectures, enabling developers to use the latest hardware features without having to rewrite code. CUDA Tile C++ vector add example Developers familiar with CUDA C++ for SIMT have likely encountered the canonical vector addition kernel. Assuming the data is already on the GPU, a vector add kernel in CUDA SIMT takes two vectors and adds them together element-wise to produce a third vector. This is one of the simplest CUDA kernels to write. It looks as follows. global void vecAdd float A, float B, float C, int vectorLength { / calculate my thread index / int workIndex = threadIdx.x + blockIdx.x blockDim.x; if workIndex < vectorLength { / perform the vector addition / C workIndex = A workIndex + B workIndex ; } } In this kernel, each thread’s work is explicitly specified, and the programmer, when launching this kernel, will specify the number of blocks and threads to be launched. Looking at the equivalent code written in CUDA Tile C++, there’s no need to specify what each thread does. Just break the data into tiles and specify the mathematical operations for these tiles. Everything else is handled. The CUDA Tile C++ kernel looks like the following: include "cuda tile.h" tile global void vectorAdd float a, float b, float out, size t n { / set up the namespace / namespace ct = cuda::tiles; using namespace ct::literals; / attach shape to raw pointers / auto aSpan = ct::tensor span{a, ct::extents{n}}; auto bSpan = ct::tensor span{b, ct::extents{n}}; auto oSpan = ct::tensor span{out, ct::extents{n}}; / partition each span into tiles of size 8 / auto aView = ct::partition view{aSpan, ct::shape{8 ic}}; auto bView = ct::partition view{bSpan, ct::shape{8 ic}}; auto oView = ct::partition view{oSpan, ct::shape{8 ic}}; / load the a and b tiles from global memory / int bx = ct::bid .x; auto aTile = aView.load bx ; // load bx-th tile auto bTile = bView.load bx ; / add the two tiles together, elementwise / auto oTile = aTile + bTile; / store the result tile to the output partition. / oView.store oTile, bx ; } This looks like a lot of code for a simple vectorAdd kernel. Don’t be alarmed. This overly verbose kernel is used to show all the steps in order. A simplified version doing the same thing with fewer lines of code follows. - The first difference is using tile global to signify to the compiler that this is a tile kernel. The array pointers and the array size are passed in as arguments, just as they are in the SIMT kernel. tile global void vectorAdd float a, float b, float out, std::size t n { - Then, set up the namespace for cuda::tiles and ct::literals . namespace ct = cuda::tiles; using namespace ct::literals; - Create a tensor span, using this code ct::tensor span for each of the three arrays. A tensor span is essentially a pointer to a multi-dimensional array in memory, similar to a C++23 std::mdspan . The tensor span carries information about the shape extents of the array as well as the layout of the array elements for example, row major or column major . The ct::extents{} tells the tensor span what the dimensions of the array are. A 1D array uses n . auto aSpan = ct::tensor span{a, ct::extents{n}}; auto bSpan = ct::tensor span{b, ct::extents{n}}; auto oSpan = ct::tensor span{out, ct::extents{n}} - Now create a partition view from a tensor span and a tile shape. A partition view is a wrapper around a tensor span that presents the array as a series of non-overlapping, fixed-sized partitions. The size of each partition is specified by the shape argument, which must be a compile-time argument. In this example, 8 ic is an integer constant that is defined by ct::literals.ct::shape<8 {} and ct::shape{8 ic} are equivalent in this context. The partition view that’s created is essentially the original array, sliced into chunks of 8, which is the tile size. auto aView = ct::partition view{aSpan, ct::shape{8 ic}}; auto bView = ct::partition view{bSpan, ct::shape{8 ic}}; auto oView = ct::partition view{oSpan, ct::shape{8 ic}}; - Load input tiles by obtaining the block index in the X dimension with ct::bid .x. If working with multi-dimensional blocks, use the Y and Z dimensions as well. Then load the a and b tiles. Use auto for convenience, but to be explicit, aTile and bTile are of type ct::tile<float, ct::shape<8 . They’re 1D tiles of size 8, with elements of type float. With the partition view, it’s easy to pass in the block index. The load function automatically fetches the correct chunk of the array and loads it into a tile. int bx = ct::bid .x; auto aTile = aView.load bx ; auto bTile = bView.load bx ; - Adding and store the results. This is one line of code does element-wise addition on input tiles and stores them in an output tile. Store that output tile to the oView partition view, indexed by the same block index in the X dimension, bx . / add the two tiles together, elementwise. / auto oTile = aTile + bTile; / store the result tile to the output partition. / oView.store oTile, bx ; Complete vector add example The following example show how to call this vector add kernel in C++ through a complete, runnable piece of code. There are a few things to note to help the compiler make optimizations. First, for best performance, the input and output arrays should only be accessed through their respective pointers while the kernel is running. There is no aliasing—access using another pointer or symbol—to the arrays when this is true. Labeling array pointers with the restrict decorator conveys this to the compiler. Using arrays with base pointers aligned to 16-byte boundaries helps the compiler generate more efficient memory access patterns. Tell the compiler that pointers are aligned by calling ct::assume aligned<16 on each of the kernel arguments. Use the return values of these calls for the compiler to take advantage of this alignment. Pointers returned by cudaMalloc or similar CUDA APIs always fulfill this, as they have 256-byte alignment https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ a-sequential-but-misaligned-access-pattern . Finally, use a tile size much larger than 8. Make these adjustments to the runnable code that follows and add the use of load masked and store masked , which handles data that might not be divisible by the tile size. The following is the complete code, including the kernel and main function. Notice the applied optimizations and reduced verbosity. include <cstdio include <cstdlib include "cuda tile.h" tile global void vectorAdd float restrict a, float restrict b, float restrict out, size t n { namespace ct = cuda::tiles; using namespace ct::literals; a = ct::assume aligned a, 16 ic ; b = ct::assume aligned b, 16 ic ; out = ct::assume aligned out, 16 ic ; int bx = ct::bid .x; / create partition views for the input tiles and load them / auto aTile = ct::partition view{ct::tensor span{a, ct::extents{n}}, ct::shape{1024 ic}}.load masked bx ; auto bTile = ct::partition view{ct::tensor span{b, ct::extents{n}}, ct::shape{1024 ic}}.load masked bx ; / add the two tiles together, elementwise. / auto oTile = aTile + bTile; / create the partition view for the output tile and then store the output tile / auto oView = ct::partition view{ct::tensor span{out, ct::extents{n}}, ct::shape{1024 ic}}; oView.store masked oTile, bx ; } / define a macro to check for CUDA errors / define checkCudaError X do {\ auto ret = X;\ if ret = cudaSuccess {\ printf "\n error on line %d, CUDART error string : %s", LINE , cudaGetErrorString ret ;\ exit 1 ;\ }\ } while 0 int main { constexpr size t N = 2ULL << 25; constexpr int TILE SIZE = 1024; constexpr int BLOCKS = N + TILE SIZE - 1 / TILE SIZE; / declare and allocate the host arrays / float h a = float malloc sizeof float N ; float h b = float malloc sizeof float N ; float h out = float malloc sizeof float N ; / initialize the host arrays / for size t idx = 0; idx < N; ++idx { h a idx = float rand / RAND MAX; h b idx = float rand / RAND MAX; h out idx = -1.0f; } / declare the device arrays / float d a{nullptr}; float d b{nullptr}; float d out{nullptr}; / allocate the device arrays / checkCudaError cudaMalloc &d a, sizeof float N ; checkCudaError cudaMalloc &d b, sizeof float N ; checkCudaError cudaMalloc &d out, sizeof float N ; / copy the host arrays to the device arrays / checkCudaError cudaMemcpy d a, h a, sizeof float N, cudaMemcpyHostToDevice ; checkCudaError cudaMemcpy d b, h b, sizeof float N, cudaMemcpyHostToDevice ; / initialize the device output array to 0 / checkCudaError cudaMemset d out, -1, sizeof float N ; / launch the kernel / vectorAdd<<<BLOCKS, 1 d a, d b, d out, N ; / synchronize the device and check for errors / checkCudaError cudaDeviceSynchronize ; / copy the device array out back to the host / checkCudaError cudaMemcpy h out, d out, sizeof float N, cudaMemcpyDeviceToHost ; / compare the results to host results / float max err = 0.0f; for size t idx = 0; idx < N; ++idx { float expected = h a idx + h b idx ; max err = fmaxf max err, fabsf h out idx - expected ; } printf "N: %zu\n", N ; printf "Max error: %e\n", max err ; checkCudaError cudaFree d a ; checkCudaError cudaFree d b ; checkCudaError cudaFree d out ; free h a ; free h b ; free h out ; } If familiar with launching SIMT kernels, the process is similar, but requires a specific modification. This kernel was launched with: vectorAdd<<<BLOCKS, 1 d a, d b, d out, N ; When launching a tile kernel, the first argument in the <<< is the number of tile blocks in SIMT, this would be the number of thread blocks . The second argument must be 1. The number of threads used to execute the kernel is determined by the compiler; always put 1 as this argument when launching a tile kernel. Running CUDA 13.3 or later on compute capability 8.0 with NVIDIA Ampere architecture or newer GPU, these commands create the following output. Adjust the -arch sm 120 command to match the architecture, include -std=c++20 when cuda tile.h is used, and the --enable-tile option to compile tile kernels. bash $ nvcc -std=c++20 --enable-tile -arch sm 120 -o vectorAdd vectorAdd.cu $ ./vectorAdd N: 67108864 Max error: 0.000000e+00 This completes the first CUDA Tile C++ program. Developer tools Tile C++ kernels can be profiled with NVIDIA Nsight Compute in the same way as SIMT kernels. The following command shows how to create a profile using Nsight Compute. bash $ ncu -o VecAddProfile --set detailed ./vectorAdd Once created and opened with the graphical version of Nsight Compute: - Select the vectorAdd kernel from the dropdown menu. - Choose the Details tab - Expand the Tile Statistics report section Figure 1 shows the profile generated from Nsight Compute. Notice the Tile Statistics report section includes the number of tile blocks specified, block size chosen by compiler , and other tile-specific information. The source page also supports tile kernels and performance metrics at the source-line level, just like CUDA C++ kernels. Matrix multiply An earlier example showed vectorAdd with the details of loading and storing partition views. This matrix multiply example illustrates how to express matrix multiply using very simple code. This kernel executes an MxK by KxN matrix multiply to compute an MxN matrix. In this kernel, M=8, N=16, and K can be variable, provided it’s a multiple of 8. Set K=24. These very small sizes are used to illustrate the concepts only. The complete kernel follows, along with an overview of the high points. include "cuda tile.h" / this kernel multiplies MxK and KxN matrices, where M=8 and N=16. K is variable but must be divisible by 8. / tile global void kernel float restrict a, float restrict b, size t length, float restrict c { namespace ct = cuda::tiles; using namespace ct::literals; a = ct::assume aligned a, 16 ic ; b = ct::assume aligned b, 16 ic ; c = ct::assume aligned c, 16 ic ; auto aShape = ct::extents{8 ic, length}; auto bShape = ct::extents{length, 16 ic}; auto cShape = ct::extents{8 ic, 16 ic}; auto aSpan = ct::tensor span{a, aShape}; auto bSpan = ct::tensor span{b, bShape}; auto cSpan = ct::tensor span{c, cShape}; auto aView = ct::partition view{aSpan, ct::shape{4 ic, 8 ic}}; auto bView = ct::partition view{bSpan, ct::shape{8 ic, 4 ic}}; auto cView = ct::partition view{cSpan, ct::shape{4 ic, 4 ic}}; using f32x4x4 = ct::tile<float, ct::shape<4, 4 ; auto accTile = ct::full<f32x4x4 0 ; auto xBlock, yBlock, dummy = ct::bid ; for auto idx : ct::irange 0, 1 + int length - 1 / 8 { auto aTile = aView.load masked xBlock, idx ; auto bTile = bView.load masked idx, yBlock ; accTile = ct::mma aTile, bTile, accTile ; } cView.store masked accTile, xBlock, yBlock ; } - Create extents with ct::extents objects for the a, b, and c matrices. Use either compile or runtime values. M=8 and N=16, but K is variable. These are used to create the tensor spans in the next step. auto aShape = ct::extents{8 ic, length}; auto bShape = ct::extents{length, 16 ic}; auto cShape = ct::extents{8 ic, 16 ic}; - Create tensor spans. This carries information about a , b , and c to create the partition views. auto aSpan = ct::tensor span{a, aShape}; auto bSpan = ct::tensor span{b, bShape}; auto cSpan = ct::tensor span{c, cShape}; - Create partition views of a , b , and c with a partitioned as a 4×8 and the view b as an 8×4 view. Adjustments can be made, provided they divide properly into a and b values. These dimensions also determine that the c view is 4×4. auto aView = ct::partition view{aSpan, ct::shape{4 ic, 8 ic}}; auto bView = ct::partition view{bSpan, ct::shape{8 ic, 4 ic}}; auto cView = ct::partition view{cSpan, ct::shape{4 ic, 4 ic}}; The 2D partitions are indexed in 2 dimensions. The a matrix is 8×24, and the partition view is 4×8, as shown in Figure 2. The partition view sizes of aView and bView also determine the shape of accTile , the tile used to accumulate results during matrix multiplication. In this example, accTile is a 4×4 tile, matching the shape of cView . using f32x4x4 = ct::tile<float, ct::shape<4, 4 ; auto accTile = ct::full<f32x4x4 0 ; - Execute the loop with ct::bid to obtain the block indices in the three dimensions. The loop iterates from 0 to the length / 8, corresponding to the overall K dimension divided by 8. The division by 8 matches the K-dimension of aView and bView is 8. Inside the loop, tiles from a and b are loaded using load masked , and call ct::mma performs the matrix multiply, accumulating the result in accTile . auto xBlock, yBlock, dummy = ct::bid ; for auto idx : ct::irange 0, int length / 8 { auto aTile = aView.load masked xBlock, idx ; auto bTile = bView.load masked idx, yBlock ; accTile = ct::mma aTile, bTile, accTile ; } - Store the value of the accTile into the partition view of c , the cView . And that’s it. Most of the kernel code is involved with setting up views for the data and loading/storing the data. The compute portion of the kernel is simple. cView.store masked accTile, xBlock, yBlock ; - Launch the kernel. Use dim3 2,4 because of the dimensions of cView . cView is 4×4, meaning each block is computing a 4×4 chunk of the C matrix. Since C is 8×16, divide the cView dimensions into the C matrix dimensions. Since 8/4=2, and 16/4=4 launch the kernel with dim3 2,4 . kernel<<<dim3 2, 4 , 1 d a, d b, K, d c ; Get started today with CUDA Tile C++ The following are required to run CUDA Tile C++ programs: - A GPU with compute capability 8.x or newer. - NVIDIA Driver R580 or later. If JIT compilation is required for tile kernels, the NVIDIA driver version must be equivalent to or newer than the version associated with the CUDA Toolkit used to generate the code. For example, CUDA Toolkit 13.3 requires an R610 driver or newer. - CUDA Toolkit 13.3 The power of tile-based programming is now available to C++ developers. Check out the documentation https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-tile-kernels.html , the API reference manual https://docs.nvidia.com/cuda/cuda-tile-cpp-api-reference/index.html , and CUDA Toolkit 13.3 http://developer.nvidia.com/cuda-downloads today to start writing tile C++ kernels and experience the new standard for accelerated computing. Acknowledgements Thanks to NVIDIA contributors Jaydeep Marathe and Ezra Stein.