{"slug": "develop-high-performance-gpu-kernels-in-c-with-nvidia-cuda-tile", "title": "Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile", "summary": "NVIDIA released CUDA 13.3, adding support for writing tile-based GPU kernels in C++ through the CUDA Tile programming model. The new capability allows developers to build highly optimized kernels using tile abstractions that automatically leverage NVIDIA hardware features including tensor cores and shared memory, without requiring direct hardware targeting. This expansion from the previously Python-only support enables integration of tile-based programming into existing C++ GPU codebases.", "body_md": "Developers can now use NVIDIA CUDA Tile programming within large existing C++ GPU codebases to develop highly optimized GPU kernels using tile-based abstractions.\n\nNVIDIA CUDA Tile, launched with [NVIDIA CUDA 13.1](https://developer.nvidia.com/blog/nvidia-cuda-13-1-powers-next-gen-gpu-programming-with-nvidia-cuda-tile-and-performance-gains), introduced [tile-based programming](https://developer.nvidia.com/blog/focus-on-your-algorithm-nvidia-cuda-tile-handles-the-hardware) for GPUs. Designed with a top-level language layer and another intermediate layer that any high-level programming language can target, CUDA Tile automatically makes use of the advanced capabilities of NVIDIA hardware—including tensor cores, shared memory, and tensor memory accelerators—without requiring the application to target them directly.\n\nPython was the first language supported for tile-based GPU applications. The newly released [CUDA 13.3](https://developer.nvidia.com/blog/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler-autotuning-and-python-updates) adds support for writing tile kernels in C++, enabling developers to build highly optimized GPU kernels.\n\n## What is CUDA Tile C++?\n\n[CUDA Tile C++](https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-tile-kernels.html) is an expression of the CUDA Tile programming model in C++, built on top of the [CUDA Tile IR specification](https://docs.nvidia.com/cuda/tile-ir/latest/). It enables developers to write tile kernels in C++ and express GPU kernels using a tile-based model, rather than or in addition to a single instruction, multiple threads (SIMT) model.\n\nAs a refresher, in the tile model:\n\n- Multi-dimensional arrays are the primary data storage.\n- Tiles are portions of arrays that kernels operate on.\n- Kernels are functions that are executed in parallel by blocks.\n- Blocks are subsets of the GPU; operations on tiles are parallelized across all the threads in each block.\n\nCUDA Tile C++ automates parallelism within blocks, along with asynchrony, memory movement, and other low-level details of GPU programming. CUDA Tile C++ is portable across different NVIDIA GPU architectures, enabling developers to use the latest hardware features without having to rewrite code.\n\n## CUDA Tile C++ vector add example\n\nDevelopers familiar with CUDA C++ for SIMT have likely encountered the canonical vector addition kernel. Assuming the data is already on the GPU, a vector add kernel in CUDA SIMT takes two vectors and adds them together element-wise to produce a third vector. This is one of the simplest CUDA kernels to write. It looks as follows.\n\n```\n__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)\n{\n /* calculate my thread index */\n int workIndex = threadIdx.x + blockIdx.x*blockDim.x;\n\n if(workIndex < vectorLength)\n {\n  /* perform the vector addition */\n  C[workIndex] = A[workIndex] + B[workIndex];\n }\n}\n```\n\nIn this kernel, each thread’s work is explicitly specified, and the programmer, when launching this kernel, will specify the number of blocks and threads to be launched.\n\nLooking at the equivalent code written in CUDA Tile C++, there’s no need to specify what each thread does. Just break the data into tiles and specify the mathematical operations for these tiles. Everything else is handled.\n\nThe CUDA Tile C++ kernel looks like the following:\n\n```\n#include \"cuda_tile.h\"\n__tile_global__ void vectorAdd(float* a, float* b, float* out, size_t n) {\n\n/* set up the namespace */\n  namespace ct = cuda::tiles;\n  using namespace ct::literals;\n\n/* attach shape to raw pointers */\n  auto aSpan = ct::tensor_span{a,   ct::extents{n}};\n  auto bSpan = ct::tensor_span{b,   ct::extents{n}};\n  auto oSpan = ct::tensor_span{out, ct::extents{n}};\n\n/* partition each span into tiles of size 8 */\n  auto aView = ct::partition_view{aSpan, ct::shape{8_ic}};\n  auto bView = ct::partition_view{bSpan, ct::shape{8_ic}};\n  auto oView = ct::partition_view{oSpan, ct::shape{8_ic}};\n\n/* load the a and b tiles from global memory */\n  int bx = ct::bid().x;\n  auto aTile = aView.load(bx);          // load bx-th tile\n  auto bTile = bView.load(bx);\n\n/* add the two tiles together, elementwise */\n  auto oTile = aTile + bTile;\n\n/* store the result tile to the output partition. */\n  oView.store(oTile, bx); \n}\n```\n\nThis looks like a lot of code for a simple vectorAdd kernel. Don’t be alarmed. This overly verbose kernel is used to show all the steps in order. A simplified version doing the same thing with fewer lines of code follows.\n\n- The first difference is using\n`__tile_global__`\n\nto signify to the compiler that this is a tile kernel. The array pointers and the array size are passed in as arguments, just as they are in the SIMT kernel.\n\n```\n__tile_global__ void vectorAdd(float* a, float* b, float* out, std::size_t n) {\n```\n\n- Then, set up the namespace for\n`cuda::tiles`\n\nand`ct::literals`\n\n.\n\n```\n  namespace ct = cuda::tiles;  \n  using namespace ct::literals;\n```\n\n- Create a tensor span, using this code\n`ct::tensor_span`\n\nfor each of the three arrays. A tensor span is essentially a pointer to a multi-dimensional array in memory, similar to a C++23`std::mdspan`\n\n.\n\nThe tensor span carries information about the shape (extents) of the array as well as the layout of the array elements (for example, row major or column major).\n\nThe`ct::extents{}`\n\ntells the tensor span what the dimensions of the array are. A 1D array uses`n`\n\n.\n\n```\nauto aSpan = ct::tensor_span{a,   ct::extents{n}};\nauto bSpan = ct::tensor_span{b,   ct::extents{n}};\nauto oSpan = ct::tensor_span{out, ct::extents{n}}\n```\n\n- Now create a partition view from a tensor span and a tile shape. A partition view is a wrapper around a tensor span that presents the array as a series of non-overlapping, fixed-sized partitions.\n\nThe size of each partition is specified by the shape argument, which must be a compile-time argument.\n\nIn this example,`8_ic`\n\nis an integer constant that is defined by`ct::literals.ct::shape<8>{}`\n\nand`ct::shape{8_ic}`\n\nare equivalent in this context. The partition view that’s created is essentially the original array, sliced into chunks of 8, which is the tile size.\n\n```\n  auto aView = ct::partition_view{aSpan, ct::shape{8_ic}};\n  auto bView = ct::partition_view{bSpan, ct::shape{8_ic}};\n  auto oView = ct::partition_view{oSpan, ct::shape{8_ic}};\n```\n\n- Load input tiles by obtaining the block index in the X dimension with\n`ct::bid().x.`\n\nIf working with multi-dimensional blocks, use the Y and Z dimensions as well.\n\nThen load the`a`\n\nand`b`\n\ntiles. Use auto for convenience, but to be explicit,`aTile`\n\nand`bTile`\n\nare of type`ct::tile<float, ct::shape<8>>>`\n\n. They’re 1D tiles of size 8, with elements of type float. With the partition view, it’s easy to pass in the block index. The load function automatically fetches the correct chunk of the array and loads it into a tile.\n\n```\nint bx = ct::bid().x;\nauto aTile = aView.load(bx);         \nauto bTile = bView.load(bx);\n```\n\n- Adding and store the results. This is one line of code does element-wise addition on input tiles and stores them in an output tile. Store that output tile to the\n`oView`\n\npartition view, indexed by the same block index in the X dimension,`bx`\n\n.\n\n```\n/* add the two tiles together, elementwise. */\nauto oTile = aTile + bTile;\n\n/* store the result tile to the output partition. */\noView.store(oTile, bx);\n```\n\n## Complete vector add example\n\nThe following example show how to call this vector add kernel in C++ through a complete, runnable piece of code.\n\nThere are a few things to note to help the compiler make optimizations.\n\nFirst, for best performance, the input and output arrays should only be accessed through their respective pointers while the kernel is running. There is no aliasing—access using another pointer or symbol—to the arrays when this is true. Labeling array pointers with the `__restrict__`\n\ndecorator conveys this to the compiler.\n\nUsing arrays with base pointers aligned to 16-byte boundaries helps the compiler generate more efficient memory access patterns. Tell the compiler that pointers are aligned by calling `ct::assume_aligned<16>`\n\non each of the kernel arguments. Use the return values of these calls for the compiler to take advantage of this alignment. Pointers returned by `cudaMalloc`\n\nor similar CUDA APIs always fulfill this, as they [have 256-byte alignment](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#a-sequential-but-misaligned-access-pattern).\n\nFinally, use a tile size much larger than 8. Make these adjustments to the runnable code that follows and add the use of `load_masked`\n\nand `store_masked`\n\n, which handles data that might not be divisible by the tile size.\n\nThe following is the complete code, including the kernel and main function. Notice the applied optimizations and reduced verbosity.\n\n```\n#include <cstdio>\n#include <cstdlib>\n#include \"cuda_tile.h\"\n\n__tile_global__ void vectorAdd(float* __restrict__ a, float* __restrict__ b, float* __restrict__ out, size_t n) {\n  namespace ct = cuda::tiles;\n  using namespace ct::literals;\n\n  a   = ct::assume_aligned(a,   16_ic);\n  b   = ct::assume_aligned(b,   16_ic);\n  out = ct::assume_aligned(out, 16_ic);  \n\n  int bx     = ct::bid().x;\n  \n/* create partition views for the input tiles and load them */\n  auto aTile = ct::partition_view{ct::tensor_span{a,   ct::extents{n}}, ct::shape{1024_ic}}.load_masked(bx);\n  auto bTile = ct::partition_view{ct::tensor_span{b,   ct::extents{n}}, ct::shape{1024_ic}}.load_masked(bx);\n  \n/* add the two tiles together, elementwise. */\n  auto oTile = aTile + bTile;\n\n/* create the partition view for the output tile and then store the output tile*/  \n  auto oView = ct::partition_view{ct::tensor_span{out, ct::extents{n}}, ct::shape{1024_ic}};  \n  oView.store_masked(oTile, bx);\n}  \n\n/* define a macro to check for CUDA errors */\n#define checkCudaError(X) do {\\\n  auto ret = X;\\\n  if (ret != cudaSuccess) {\\\n    printf(\"\\n error on line %d, CUDART error string : %s\", __LINE__, cudaGetErrorString(ret));\\\n    exit(1);\\\n  }\\\n} while (0)\n\nint main() {\n  constexpr size_t N = 2ULL << 25;\n  constexpr int TILE_SIZE = 1024;\n  constexpr int BLOCKS = (N + TILE_SIZE - 1) / TILE_SIZE;\n\n/* declare and allocate the host arrays */\n  float* h_a   = (float*)malloc(sizeof(float) * N);\n  float* h_b   = (float*)malloc(sizeof(float) * N);\n  float* h_out = (float*)malloc(sizeof(float) * N);\n\n/* initialize the host arrays */\n  for (size_t idx = 0; idx < N; ++idx) {\n    h_a[idx] = (float)rand() / RAND_MAX;\n    h_b[idx] = (float)rand() / RAND_MAX;\n    h_out[idx] = -1.0f;\n  }\n\n/* declare the device arrays */\n  float* d_a{nullptr};\n  float* d_b{nullptr};\n  float* d_out{nullptr};\n\n/* allocate the device arrays */\n  checkCudaError(cudaMalloc(&d_a, sizeof(float) * N));\n  checkCudaError(cudaMalloc(&d_b, sizeof(float) * N));\n  checkCudaError(cudaMalloc(&d_out, sizeof(float) * N));\n\n/* copy the host arrays to the device arrays */\n  checkCudaError(cudaMemcpy(d_a, h_a, sizeof(float) * N, cudaMemcpyHostToDevice));\n  checkCudaError(cudaMemcpy(d_b, h_b, sizeof(float) * N, cudaMemcpyHostToDevice));\n\n/* initialize the device output array to 0 */\n  checkCudaError(cudaMemset(d_out, -1, sizeof(float) * N));\n\n/* launch the kernel */\n  vectorAdd<<<BLOCKS, 1>>>(d_a, d_b, d_out, N);\n\n/* synchronize the device and check for errors */\n  checkCudaError(cudaDeviceSynchronize());\n\n/* copy the device array out back to the host */\n  checkCudaError(cudaMemcpy(h_out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost));\n\n/* compare the results to host results */\n\n  float max_err = 0.0f;\n  for (size_t idx = 0; idx < N; ++idx) {\n    float expected = h_a[idx] + h_b[idx];\n    max_err = fmaxf(max_err, fabsf(h_out[idx] - expected));\n  }\n\n  printf(\"N: %zu\\n\", N);\n  printf(\"Max error: %e\\n\", max_err);\n\n  checkCudaError(cudaFree(d_a));\n  checkCudaError(cudaFree(d_b));\n  checkCudaError(cudaFree(d_out));\n\n  free(h_a);\n  free(h_b);\n  free(h_out);\n}\n```\n\nIf familiar with launching SIMT kernels, the process is similar, but requires a specific modification. This kernel was launched with:\n\n```\nvectorAdd<<<BLOCKS, 1>>>(d_a, d_b, d_out, N);\n```\n\nWhen launching a tile kernel, the first argument in the `<<<>>>`\n\nis the number of tile blocks (in SIMT, this would be the number of thread blocks). The second argument must be 1. The number of threads used to execute the kernel is determined by the compiler; always put 1 as this argument when launching a tile kernel.\n\nRunning CUDA 13.3 or later on compute capability 8.0 with NVIDIA Ampere architecture or newer GPU, these commands create the following output.\n\nAdjust the `-arch sm_120`\n\ncommand to match the architecture, include `-std=c++20`\n\nwhen `cuda_tile.h`\n\nis used, and the `--enable-tile`\n\noption to compile tile kernels.\n\n``` bash\n$ nvcc -std=c++20 --enable-tile -arch sm_120 -o vectorAdd vectorAdd.cu\n$ ./vectorAdd\nN: 67108864\nMax error: 0.000000e+00\n```\n\nThis completes the first CUDA Tile C++ program.\n\n## Developer tools\n\nTile C++ kernels can be profiled with NVIDIA Nsight Compute in the same way as SIMT kernels. The following command shows how to create a profile using Nsight Compute.\n\n``` bash\n$ ncu -o VecAddProfile --set detailed ./vectorAdd\n```\n\nOnce created and opened with the graphical version of Nsight Compute:\n\n- Select the\n`vectorAdd`\n\nkernel from the dropdown menu. - Choose the\n**Details** tab - Expand the\n**Tile Statistics** report section\n\nFigure 1 shows the profile generated from Nsight Compute.\n\nNotice the **Tile Statistics** report section includes the number of tile blocks specified, block size (chosen by compiler), and other tile-specific information.\n\nThe source page also supports tile kernels and performance metrics at the source-line level, just like CUDA C++ kernels.\n\n## Matrix multiply\n\nAn earlier example showed `vectorAdd`\n\nwith the details of loading and storing partition views. This matrix multiply example illustrates how to express matrix multiply using very simple code.\n\nThis kernel executes an MxK by KxN matrix multiply to compute an MxN matrix. In this kernel, M=8, N=16, and K can be variable, provided it’s a multiple of 8. Set K=24. These very small sizes are used to illustrate the concepts only.\n\nThe complete kernel follows, along with an overview of the high points.\n\n```\n#include \"cuda_tile.h\"\n\n/* this kernel multiplies MxK and KxN matrices, where M=8 and N=16.  K is variable but must be divisible by 8.*/\n__tile_global__ void kernel(float* __restrict__ a, float* __restrict__ b, size_t length, float* __restrict__ c) {\n    namespace ct = cuda::tiles;\n    using namespace ct::literals;\n\n    a = ct::assume_aligned(a, 16_ic);\n    b = ct::assume_aligned(b, 16_ic);\n    c = ct::assume_aligned(c, 16_ic);\n\n    auto aShape = ct::extents{8_ic, length};\n    auto bShape = ct::extents{length, 16_ic};\n    auto cShape = ct::extents{8_ic, 16_ic};\n\n    auto aSpan = ct::tensor_span{a, aShape};\n    auto bSpan = ct::tensor_span{b, bShape};\n    auto cSpan = ct::tensor_span{c, cShape};\n\n    auto aView = ct::partition_view{aSpan, ct::shape{4_ic, 8_ic}};\n    auto bView = ct::partition_view{bSpan, ct::shape{8_ic, 4_ic}};\n    auto cView = ct::partition_view{cSpan, ct::shape{4_ic, 4_ic}};\n    \n    using f32x4x4 = ct::tile<float, ct::shape<4, 4>>;\n    auto accTile = ct::full<f32x4x4>(0);\n\n    auto [xBlock, yBlock, dummy] = ct::bid();\n    for (auto idx : ct::irange(0, 1 + int(length - 1) / 8)) {\n        auto aTile = aView.load_masked(xBlock, idx);\n        auto bTile = bView.load_masked(idx, yBlock);\n        accTile = ct::mma(aTile, bTile, accTile);\n    }\n\n    cView.store_masked(accTile, xBlock, yBlock);\n}\n```\n\n- Create extents with\n`ct::extents`\n\nobjects for the`a, b,`\n\nand`c`\n\nmatrices. Use either compile or runtime values. M=8 and N=16, but K is variable. These are used to create the tensor spans in the next step.\n\n```\n auto aShape = ct::extents{8_ic, length};\n auto bShape = ct::extents{length, 16_ic};\n auto cShape = ct::extents{8_ic, 16_ic};\n```\n\n- Create tensor spans. This carries information about\n`a`\n\n,`b`\n\n, and`c`\n\nto create the partition views.\n\n```\n    auto aSpan = ct::tensor_span{a, aShape};\n    auto bSpan = ct::tensor_span{b, bShape};\n    auto cSpan = ct::tensor_span{c, cShape};\n```\n\n- Create partition views of\n`a`\n\n,`b`\n\n, and`c`\n\nwith`a`\n\npartitioned as a 4×8 and the view`b`\n\nas an 8×4 view. Adjustments can be made, provided they divide properly into`a`\n\nand`b`\n\nvalues. These dimensions also determine that the`c`\n\nview is 4×4.\n\n```\n  auto aView = ct::partition_view{aSpan, ct::shape{4_ic, 8_ic}};\n  auto bView = ct::partition_view{bSpan, ct::shape{8_ic, 4_ic}};\n  auto cView = ct::partition_view{cSpan, ct::shape{4_ic, 4_ic}};\n```\n\nThe 2D partitions are indexed in 2 dimensions. The `a`\n\nmatrix is 8×24, and the partition view is 4×8, as shown in Figure 2.\n\nThe partition view sizes of `aView`\n\nand `bView`\n\nalso determine the shape of `accTile`\n\n, the tile used to accumulate results during matrix multiplication. In this example, `accTile`\n\nis a 4×4 tile, matching the shape of `cView`\n\n.\n\n```\n    using f32x4x4 = ct::tile<float, ct::shape<4, 4>>;\n    auto accTile = ct::full<f32x4x4>(0);\n```\n\n- Execute the loop with\n`ct::bid`\n\nto obtain the block indices in the three dimensions. The loop iterates from 0 to the length / 8, corresponding to the overall K dimension divided by 8. The division by 8 matches the K-dimension of`aView`\n\nand`bView`\n\nis 8. Inside the loop, tiles from`a`\n\nand`b`\n\nare loaded using`load_masked`\n\n, and call`ct::mma`\n\nperforms the matrix multiply, accumulating the result in`accTile`\n\n.\n\n```\n    auto [xBlock, yBlock, dummy] = ct::bid();\n\n    for (auto idx : ct::irange(0, int(length / 8))) {\n        auto aTile = aView.load_masked(xBlock, idx);\n        auto bTile = bView.load_masked(idx, yBlock);\n        accTile = ct::mma(aTile, bTile, accTile);\n    }\n```\n\n- Store the value of the\n`accTile`\n\ninto the partition view of`c`\n\n, the`cView`\n\n. And that’s it. Most of the kernel code is involved with setting up views for the data and loading/storing the data. The compute portion of the kernel is simple.\n\n```\n  cView.store_masked(accTile, xBlock, yBlock);\n```\n\n- Launch the kernel. Use\n`dim3(2,4)`\n\nbecause of the dimensions of`cView`\n\n.`cView`\n\nis 4×4, meaning each block is computing a 4×4 chunk of the C matrix. Since`C`\n\nis 8×16, divide the`cView`\n\ndimensions into the C matrix dimensions. Since 8/4=2, and 16/4=4 launch the kernel with`dim3(2,4)`\n\n.\n\n```\n  kernel<<<dim3(2, 4), 1>>>(d_a, d_b, K, d_c);\n```\n\n## Get started today with CUDA Tile C++\n\nThe following are required to run CUDA Tile C++ programs:\n\n- A GPU with compute capability 8.x or newer.\n- NVIDIA Driver R580 or later. If JIT compilation is required for tile kernels, the NVIDIA driver version must be equivalent to or newer than the version associated with the CUDA Toolkit used to generate the code. For example, CUDA Toolkit 13.3 requires an R610 driver or newer.\n- CUDA Toolkit 13.3\n\nThe power of tile-based programming is now available to C++ developers. Check out the [documentation](https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-tile-kernels.html), the [API reference manual](https://docs.nvidia.com/cuda/cuda-tile-cpp-api-reference/index.html), and [CUDA Toolkit 13.3](http://developer.nvidia.com/cuda-downloads) today to start writing tile C++ kernels and experience the new standard for accelerated computing.\n\n### Acknowledgements\n\n*Thanks to NVIDIA contributors Jaydeep Marathe and Ezra Stein.*", "url": "https://wpnews.pro/news/develop-high-performance-gpu-kernels-in-c-with-nvidia-cuda-tile", "canonical_source": "https://developer.nvidia.com/blog/develop-high-performance-gpu-kernels-in-cpp-with-nvidia-cuda-tile/", "published_at": "2026-05-26 21:40:16+00:00", "updated_at": "2026-05-29 08:07:39.077325+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-chips", "ai-tools"], "entities": ["NVIDIA", "CUDA Tile", "CUDA 13.1", "CUDA 13.3", "CUDA Tile C++", "CUDA Tile IR"], "alternates": {"html": "https://wpnews.pro/news/develop-high-performance-gpu-kernels-in-c-with-nvidia-cuda-tile", "markdown": "https://wpnews.pro/news/develop-high-performance-gpu-kernels-in-c-with-nvidia-cuda-tile.md", "text": "https://wpnews.pro/news/develop-high-performance-gpu-kernels-in-c-with-nvidia-cuda-tile.txt", "jsonld": "https://wpnews.pro/news/develop-high-performance-gpu-kernels-in-c-with-nvidia-cuda-tile.jsonld"}}