{"slug": "nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler", "title": "NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates", "summary": "NVIDIA released CUDA 13.3, introducing tile programming in C++ that automates low-level GPU management for optimized kernel development across all supported architectures. The update also includes CUDA Python 1.0 with green contexts and process checkpointing, plus the CompileIQ compiler autotuning framework that delivers up to 15% speedup on critical kernels like GEMM and attention.", "body_md": "NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of **NVIDIA CUDA Tile programming in C++**, enables high-level, tile-based kernel development that automatically manages complex low-level GPU details for optimal performance and portability. Additionally, CUDA Tile programming is now supported on Compute Capability 9.0 (NVIDIA Hopper) GPUs in addition to all other supported GPU architectures.\n\nWe are also releasing CUDA Python 1.0, solidifying the support and stability of the CUDA Python SW ecosystem, and introducing critical features like green contexts and process checkpointing.\n\nFor performance enthusiasts, the newly launched NVIDIA **CompileIQ** compiler auto-tuning framework delivers up to a 15% speedup on critical kernels like GEMM and attention. This release also features official **C++23 support** in NVCC, expanded tensor interoperability with **DLPack/mdspan** in CCCL 3.3, and numerous updates to the math libraries (cuBLAS, cuSPARSE, cuSOLVER) and profiling tools (Nsight Compute and Nsight Systems).\n\n## Release of CUDA Tile C++\n\nWith the release of CUDA 13.3, CUDA Tile support is extended to C++, enabling the large existing C++ codebase and developer base to create highly-optimized GPU tile kernels. This model automates parallelism, memory movement, asynchrony, and other low-level details, resulting in C++ code that is portable across NVIDIA GPU architectures. For more information, check out our [blog post](https://developer.nvidia.com/blog/develop-high-performance-gpu-kernels-in-cpp-with-nvidia-cuda-tile/).\n\n## Release of CUDA Python 1.0\n\nCUDA Python is a set of libraries that expose CUDA to the Python programming language. By providing the 1.0 release, we are committing to semantic versioning: ensuring breaking API changes only during major-version releases. Minor releases add features and patch releases are bug fixes. Any public API scheduled for removal is first deprecated in a minor release with a clear replacement path.\n\nThe following is more information on the software components included in CUDA Python 1.0.\n\nlibrary | description | next major version |\n`cuda.binding ` | Low-level Python bindings to CUDA C APIs. | 13.3.0 |\n`cuda.core` | Pythonic access to CUDA Runtime and other core functionality | 1.0.0 |\n`cccl-cuda` | Pythonic access to CCCL parallel algorithms and easy access to CCCL’s highly efficient and customizable parallel algorithms | 1.0.0 |\n`cuda-pathfinder` | Utilities for locating CUDA components installed in the user’s Python environment | 1.6 |\n\n`cuda.coop`\n\nis also available in the `cuda-cccl`\n\npackage under the `_experimental`\n\nnamespace, which is subject to API changes. `cuda.coop`\n\nprovides the reusable block-wide and warp-wide *device* primitives for use within Numba CUDA kernels.\n\n`cuda.core`\n\nis now stable\n\n`cuda.core`\n\nprovides a Pythonic interface to the CUDA runtime, including devices, streams, programs, linkers, memory resources, and graphs. Version 1.0 consolidates APIs that have been stabilizing over the previous release cycles into a single supported surface. At the same time, we added support for green contexts, CUDA checkpointing, and more.\n\n**Green contexts:** Split a GPU’s SMs into disjoint partitions, each with its own context and streams, so latency-sensitive kernels are shielded from long-running throughput kernels in the same process.**Process checkpointing**: Snapshot the full CUDA state of a running process—including device allocations, streams, context—and restore it later. Unlocks CRIU-style workflows for GPU processes: fault-tolerant long jobs, preemption and migration on shared clusters, and fast warm-start of inference workers. Only available in Linux.**Inter-process sharing (IPC)**: Share GPU memory across Python processes without copying through the host. One process allocates, and others map the same physical VRAM into their own address space. Ideal for multi-process ML serving and zero-copy producer/consumer pipelines.\n\nThe following are quick examples of how to use `cuda.core`\n\nAPIs.\n\n``` python\nfrom cuda.core import Device, Stream, Program, ProgramOptions, LaunchConfig, launch\n\n# pick and activate a GPU\ndev = Device()\ndev.set_current()\n\n# create a CUDA stream\nstream = dev.create_stream()\n\n# NVRTC compile + lookup\nprog = Program(src, code_type=\"c++\", options = ProgramOptions(arch=f\"sm_{dev.arch}\"))\nkernel = prog.compile(\"cubin\").get_kernel(\"my_kernel\")                       \n\n# launch a kernel\nlaunch(stream, LaunchConfig(grid=64, block=256), kernel, *args)\n\n# JIT-LTO linking\nfrom cuda.core import Linker, LinkerOptions\n\nmodule = Linker(\n    [obj1, obj2],           \n    options=LinkerOptions(arch=f\"sm_{dev.arch}\")\n).link(\"cubin\")\n\n# NVRTC precompiled headers\nfrom cuda.core import ProgramOptions\n\nopts = ProgramOptions(std=\"c++17\", arch=f\"sm_{dev.arch}\", create_pch=True, pch_dir=\"/tmp/pch\")\n\n# Memory resources, incl. NUMA-aware pools\nfrom cuda.core import DeviceMemoryResource, PinnedMemoryResource, PinnedMemoryResourceOptions, ManagedMemoryResource, ManagedMemoryResourceOptions\n\n# NUMA-pinned host memory\npinned = PinnedMemoryResource(PinnedMemoryResourceOptions(numa_id=0))\n\n# CUDA graphs: stream capture and explicit construction             \n\nfrom cuda.core.graph import GraphBuilder, GraphDef\n\ngb = stream.create_graph_builder()\ngb.begin_building()\ngraph = gb.end_building().complete()\ngraph.launch(stream)\ngdef = GraphDef()\ngdef.add_kernel_node(kernel, LaunchConfig(grid=64, block=256), args=args)\n\n# IPC: share GPU memory across Python processes\nfrom cuda.core import DeviceMemoryResource, DeviceMemoryResourceOptions\n\nmr = DeviceMemoryResource(dev,\n        options=DeviceMemoryResourceOptions(max_size=1 << 20, ipc_enabled=True))\nbuffer = mr.allocate(nbytes)   # buffer is picklable and can be sent over mp.Queue\n\n# Green contexts: partition SMs into disjoint groups\nfrom cuda.core import ContextOptions, SMResourceOptions\nsm = dev.resources.sm\nlong_grp, crit_grp = sm.split(SMResourceOptions(count=(sm.sm_count - 16, 16)))[0]\nctx_crit = dev.create_context(ContextOptions(resources=[crit_grp]))\ns_crit = ctx_crit.create_stream()\n\n# Process checkpoint / restore (Linux)\nfrom cuda.core import checkpoint\n\nproc = checkpoint.Process(os.getpid())\nproc.lock(timeout_ms=5000)\nproc.checkpoint()\nproc.restore()\nproc.unlock()\n# device allocations and context are restored\n\n# TMA / TensorMapDescriptor\nfrom cuda.core import StridedMemoryView, TensorMapDescriptor\ntmap = StridedMemoryView(tensor).as_tensor_map(box_shape=(128,))\n\n# DLPack-friendly strided views\nfrom cuda.core.utils import StridedMemoryView\nview = StridedMemoryView(torch_tensor); capsule = view.__dlpack__()\n\n# System info (NVML)\nfrom cuda.core import system\nprint(system.num_devices, system.driver_version)\n\n# cuda.bindings.nvml\nfrom cuda.bindings import nvml\n\nnvml.init()\nname = nvml.device_get_name(nvml.device_get_handle_by_index_v2(0))\n\n# cuda.bindings.nvfatbin\nfrom cuda.bindings import nvfatbin\n\nhandle = nvfatbin.create()\n```\n\n### CCCL Python release 1.0.0: `cuda.compute`\n\n`cuda.compute`\n\nbrings the CUDA Core Compute Libraries (CCCL)’s highly tuned parallel algorithms—sort, scan, reduce, transform, unique, histogram, top-k, and more—to Python as host-callable building blocks. Changes since the last release include:\n\n- Python lambdas can be used as algorithm operators, reducing boilerplate for simple reductions, scans, transforms, and predicates.\n- Algorithms support operators with side effects (state), enabling use cases like running accumulators and conditional transforms.\n- New\n`cuda.compute.upper_bound`\n\nand`cuda.compute.lower_bound`\n\nAPIs expose CUB’s parallel binary search to Python. - Consolidated caching across all algorithms for faster repeated invocations.\n\n``` python\nimport cuda.compute\nfrom cuda.compute import OpKind\n\nd_input = cp.arange(1, 1_000_001, dtype=cp.int32)\nd_output = cp.empty(1, dtype=cp.int32)\nh_init = np.array([0], dtype=np.int32)\n\ncuda.compute.reduce_into(\n    d_input, d_output, OpKind.PLUS, d_input.size, h_init\n)\n\ncuda.compute.reduce_into(\n    d_input, d_output,\n    lambda a, b: a if a > b else b,\n    d_input.size, h_init,\n)\n```\n\n`cuda.coop`\n\nexposes CCCL’s warp-wide and block-wide cooperative primitives for use inside Numba CUDA kernels. At the moment, this module is under the `_experimental`\n\nnamespace and may have API changes that don’t follow semantic versioning.\n\n``` python\nfrom numba import cuda\nfrom cuda.coop._experimental import block, warp\n\nTHREADS = 128\nblock_sum = coop.block.make_sum(numba.int32, THREADS)\n\n@cuda.jit(link=block_sum.files)\ndef reduce_kernel(data, out):\n    # Each thread contributes one element to the block-wide reduction\n    total = block_sum(data[cuda.threadIdx.x])\n    if cuda.threadIdx.x == 0:\n        out[0] = total\n\nh_in = np.ones(THREADS, dtype=np.int32)\nd_in = cuda.to_device(h_in)\nd_out = cuda.device_array(1, dtype=np.int32)\n\nreduce_kernel[1, THREADS](d_in, d_out)\n\nassert d_out.copy_to_host()[0] == THREADS  # 128\n```\n\n### New Numba CUDA MLIR backend\n\nNumba CUDA MLIR is a new Numba-compatible kernel generator for Python, written from the ground up on top of MLIR and the modern NVVM toolchain. It preserves the familiar `@cuda.jit`\n\nprogramming model from Numba-CUDA while delivering lower compile latency, better diagnostics, and a cleaner path to target new GPU architectures and features as they land in the NVVM stack. Numba CUDA MLIR can be used as a drop-in replacement for `numba.cuda`\n\nby simply replacing the import statement:\n\n``` python\n# Before\nfrom numba import cuda\n\n# After\nfrom numba_cuda_mlir import cuda\n\n@cuda.jit\ndef vector_add(a, b, out):\n    i = cuda.grid(1)\n    if i < out.shape[0]:\n        out[i] = a[i] + b[i]\n```\n\nBeyond existing Numba-CUDA compatibility Numba CUDA MLIR also features:\n\n[Faster JIT compile](https://github.com/NVIDIA/numba-cuda-mlir/blob/main/tests/benchmarks/README.md). Across a suite of real kernels (vector add, softmax, Cholesky, attention, Black-Scholes, FFT, matmul), warm JIT compile times are ~1.4x faster on geomean and up to ~2x faster on individual kernels versus Numba-CUDA.[Lower launch latency](https://github.com/NVIDIA/numba-cuda-mlir/blob/main/tests/benchmarks/launch_latency_ubench/README.md). Host-side kernel dispatch overhead drops by roughly 2-3.5x for typical kernels and up to ~17x for kernels with many scalar arguments, where argument packing previously dominated.\n\nYou can test Numba CUDA MLIR 0.3 by installing it from PyPI `numba-cuda-mlir[cu13]`\n\nand follow its development on GitHub.\n\n## Try CUDA Python today\n\nInstall the CUDA Python stack directly from PyPI:\n\n```\npip install cuda-python cuda-cccl numba-cuda-mlir[cu13]\n```\n\nThis pulls in `cuda.bindings 13.3.0`\n\n, `cuda.core 1.0.0`\n\n, `cuda.compute 1.0.0`\n\n, along with `cuda-pathfinder`\n\nfor library discovery.\n\n## CompileIQ launched\n\nA new compiler auto-tuning framework for maximum performance on GPU kernels called CompileIQ, launches with CUDA 13.3. GPU compilers apply generic optimization heuristics that are broadly effective but aren’t necessarily optimal for specific kernels. CompileIQ flips this dynamic by using evolutionary and genetic algorithms to generate specialized compiler configurations custom-tailored to each kernel.\n\nThis unlocks extra performance. For example, for critical kernels like GEMM and attention, which account for over 90% of LLM inference compute, CompileIQ delivers up to a **15% speedup** on already-optimized Triton attention and CUTLASS GEMM kernels. Read more about CompileIQ, including how it works and how to use it, in this blog post.\n\n## Math libraries\n\nCore CUDA math libraries in CUDA 13.3 include several new features and notable performance improvements available, including:\n\n- cuSPARSE:\n- Support for CSC format in SpSV and SpSM.\n- Support for mixed precision in SpMVOp.\n- Support for mixed index type (64-bit offset, 32-bit index) CSR matrix in SpMvOp computation\n- Improved\n`cusparseSpMVOp_createDescr()`\n\nperformance by 2.5x. - Introduced new API SPMVOP_ALG1, which supports:\n- Updating matrix values while maintaining the same sparsity pattern.\n- Optimized buffer size.\n- Reduced preprocess overhead.\n\n- cuBLAS:\n- CUDA green context support.\n- Performance improvement to FP4 matmuls on NVIDIA Blackwell Ultra.\n- Performance improvement to TF32 matmuls on NVIDIA Blackwell and Blackwell Ultra.\n- SYMV performance improvements for NVIDIA Hopper, Blackwell, and Blackwell Ultra.\n- Improved user experience for FP64 emulated matmuls by enforcing a fixed workspace size that is constant across the problem space.\n\n- cuSOLVER:\n- A 64-bit interface\n`cusolverDnXpolar`\n\nexposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn - A 64-bit interface\n`cusolverDnXstedc`\n\n, which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method - Performance improvements for\n`cusolverDnXgeev`\n\nwith eigenvectors by moving the eigenvector post-processing from the host to the device.\n\n- A 64-bit interface\n\n- Public 64-bit interface\n`cusolverDnXpolar`\n\n, which exposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn (available in 13.2 U1). - Public 64-bit interface\n`cusolverDnXstedc`\n\n, which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method (available in 13.2 U1). - Performance improvements for\n`cusolverDnXgeev`\n\nwith eigenvectors by moving the eigenvector post-processing from the host to the device. `cusolverDn[D,Z]syevj`\n\nuses low-precision preconditioning, which typically improves the time-to-solution by 20% for mid-sized and large matrices on B200, and by even more on GPUs with a large FP32: FP64 ratio.\n\n## CCCL\n\nCUDA 13.3 ships with CCCL 3.3. Highlights include DLPack/mdspan interoperability, a comprehensive random number distribution library, new search and segmented scan algorithms, and a flexible N-to-M transform.\n\n### Tensor interoperability\n\nDeep learning frameworks speak in tensors, but CUDA C++ code often has to work one level lower—raw pointers, shapes, strides, and hand-written indexing. CCCL makes it easier to preserve that tensor structure across the boundary between Python frameworks and CUDA C++. With DLPack interoperability, tensors from frameworks such as PyTorch, JAX, and CuPy can be converted into `cuda::std::mdspan`\n\nviews with [ cuda::to_device_mdspan](https://nvidia.github.io/cccl/unstable/libcudacxx/extended_api/mdspan/dlpack_to_mdspan.html#conversion-functions) for use in C++ kernels, and\n\n`cuda::std::mdspan`\n\nviews can be converted back to DLPack with [.](https://nvidia.github.io/cccl/unstable/libcudacxx/extended_api/mdspan/dlpack_to_mdspan.html#conversion-functions)\n\n`cuda::to_dlpack_tensor`\n\nCCCL also extends this tensor-view model inside kernels with [ cuda::shared_memory_mdspan](https://nvidia.github.io/cccl/unstable/libcudacxx/extended_api/mdspan/shared_memory_accessor.html). Instead of treating shared memory as a flat buffer, developers can create multi-dimensional views over shared-memory tiles, making indexing clearer and less error-prone. The shared-memory specialization also provides address-space safety checks and guarantees shared-memory load/store instructions.\n\n### Random number distributions\n\nCCCL 3.3 adds a comprehensive set of device-compatible random distributions to [ <cuda/std/random>](https://nvidia.github.io/cccl/unstable/libcudacxx/standard_api/numerics_library/random.html), bringing libcu++ to near-parity with the C++ standard library’s\n\n`<random>`\n\nheader. CCCL 3.3 brings a comprehensive set of 17 random uniform, normal, Poisson, and Bernoulli distributions. In addition, CCCL 3.3 backports the `cuda::std::philox4x32`\n\nand `cuda::std::philox4x64`\n\nengines from C++26 to C++17 and adds `cuda::pcg64`\n\nas an extension in `<cuda/random>`\n\n. PCG64 is the default PRNG in Numpy and provides a good balance between quality and performance.\n\n```\n#include <cuda/random>\n#include <cuda/std/random>\n\n__global__ void sample_kernel() {\n    cuda::pcg64 rng(threadIdx.x);\n    cuda::std::normal_distribution<float> dist(0.0f, 1.0f);\n    float sample = dist(rng);\n}\n```\n\n### Search: `cub::DeviceFind::FindIf`\n\nCCCL 3.3 adds [ cub::DeviceFind::FindIf](https://nvidia.github.io/cccl/unstable/cub/api/structcub_1_1DeviceFind.html#_CPPv4I0000EN3cub10DeviceFind6FindIfE11cudaError_tPvR6size_t14InputIteratorT15OutputIteratorT7ScanOpT9NumItemsT12cudaStream_t), a new speed-of-light device-wide search algorithm for finding the first element that satisfies a predicate.\n\n```\ncub::DeviceFind::FindIf(\n  d_temp, temp_bytes, input, output, \n  [] __device__ (int value) {\n    return value > 42;\n  }, num_items);\n```\n\nThis algorithm delivers up to 7x speedup compared to the search implementation used in CCCL 3.2 and accelerates Thrust’s search and predicate-query algorithms, including `thrust::find_if`\n\n, `thrust::all_of`\n\n, `thrust::any_of`\n\n, `thrust::none_of`\n\n, `thrust::equal`\n\n, `thrust::mismatch`\n\n, `thrust::is_sorted`\n\n, `thrust::partition_point`\n\n, and more.\n\nMore new algorithms in CCCL 3.3 include:\n\n- Segmented scan:\nprovides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.`cub::DeviceSegmentedScan`\n\n- Binary search:\n/`cub::DeviceFind::LowerBound`\n\n`UpperBound`\n\nperforms a parallel search for multiple values in an ordered sequence. - Transform:\nnow supports transforming N input sequences into M output sequences.`cub::DeviceTransform`\n\n## Compilers/NVCC\n\nC++23 support: Full C++23 integration in `nvcc`\n\nand `nvrtc`\n\nempowers developers to use the latest language standard.This release modernizes the CUDA development experience, ensuring codebase consistency with modern standards while significantly improving cross-platform portability.\n\n- Enhanced\n`nvrtc`\n\nout-of-the-box experience: By bundling standard CUDA C++ headers, NVRTC streamlines the runtime compilation process and reduces prerequisite setup.This update simplifies include-path management, enabling faster implementation of portable and robust runtime compilation workflows. - Integrated nvprune in\n`nvcc`\n\n: The inclusion of pruning capabilities directly within the compiler allows for more efficient artifact management and simplified multi-arch deployment.\n\n## More CUDA 13.3 enhancements\n\nMore enhancements in CUDA 13.3 are detailed in this section.\n\n### MPS partial error isolation\n\nMPS has added support for partial error isolation. When using this feature, the CUDA driver can attribute the error to the faulting partition/client and terminate that client’s work, while other clients in other partitions that did not cause the fault won’t be terminated. For more info on how to use this feature, see the [release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html).\n\n### Enable graph recapture to an existing graph\n\nIn CUDA graphs, a new API [ cudaStreamBeginRecaptureToGraph()](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g980baa726cb9a77b21ed8f58a1e75b97) enables you to initiate a stream capture into an existing source graph. As the graph is recaptured, any updated node parameters will be updated in the existing node.\n\n### Default stream creation is optional in green contexts\n\nGreen Contexts used in the CUDA Driver API no longer require the creation of the default (NULL) stream via the `CU_GREEN_CTX_DEFAULT_STREAM`\n\nflag. Creation of this stream is now optional.\n\n### NVML reports inactive remapped rows\n\nA new NVML API, `nvmlDeviceGetRemappedRows_v2`\n\n, can acquire the number of inactive row remappings while the old API, `nvmlDeviceGetRemappedRows`\n\n, now returns only the number of active row remappings.\n\n### Added `mmap()`\n\nsupport\n\nThis release extends `mmap()`\n\nsupport, providing a low-latency CPU mapping of discrete GPU memory in environments where it may be disadvantageous to install [GDRCopy](https://github.com/NVIDIA/gdrcopy) kernel drivers.\n\n## Get started\n\n[Download CUDA Toolkit 13.3](https://developer.nvidia.com/cuda-downloads) and get started today.\n\n### Acknowledgments\n\n*Thanks to NVIDIA contributors Andy Terrel, Rob Armstrong, Jackson Marusarz, Becca Zandstein, Mridula Prakash, Daniel Rodriguez, and Georgii Evtushenko.*", "url": "https://wpnews.pro/news/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler", "canonical_source": "https://developer.nvidia.com/blog/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler-autotuning-and-python-updates/", "published_at": "2026-05-26 21:39:17+00:00", "updated_at": "2026-05-29 08:07:46.787595+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-chips", "ai-tools", "ai-research", "machine-learning"], "entities": ["NVIDIA", "CUDA", "Hopper", "CompileIQ", "NVCC", "CCCL", "Nsight Compute", "Nsight Systems"], "alternates": {"html": "https://wpnews.pro/news/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler", "markdown": "https://wpnews.pro/news/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler.md", "text": "https://wpnews.pro/news/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler.txt", "jsonld": "https://wpnews.pro/news/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler.jsonld"}}