NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates

NVIDIA released CUDA 13.3, introducing tile programming in C++ that automates low-level GPU management for optimized kernel development across all supported architectures. The update also includes CUDA Python 1.0 with green contexts and process checkpointing, plus the CompileIQ compiler autotuning framework that delivers up to 15% speedup on critical kernels like GEMM and attention.

NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in C++ , enables high-level, tile-based kernel development that automatically manages complex low-level GPU details for optimal performance and portability. Additionally, CUDA Tile programming is now supported on Compute Capability 9.0 NVIDIA Hopper GPUs in addition to all other supported GPU architectures. We are also releasing CUDA Python 1.0, solidifying the support and stability of the CUDA Python SW ecosystem, and introducing critical features like green contexts and process checkpointing. For performance enthusiasts, the newly launched NVIDIA CompileIQ compiler auto-tuning framework delivers up to a 15% speedup on critical kernels like GEMM and attention. This release also features official C++23 support in NVCC, expanded tensor interoperability with DLPack/mdspan in CCCL 3.3, and numerous updates to the math libraries cuBLAS, cuSPARSE, cuSOLVER and profiling tools Nsight Compute and Nsight Systems . Release of CUDA Tile C++ With the release of CUDA 13.3, CUDA Tile support is extended to C++, enabling the large existing C++ codebase and developer base to create highly-optimized GPU tile kernels. This model automates parallelism, memory movement, asynchrony, and other low-level details, resulting in C++ code that is portable across NVIDIA GPU architectures. For more information, check out our blog post https://developer.nvidia.com/blog/develop-high-performance-gpu-kernels-in-cpp-with-nvidia-cuda-tile/ . Release of CUDA Python 1.0 CUDA Python is a set of libraries that expose CUDA to the Python programming language. By providing the 1.0 release, we are committing to semantic versioning: ensuring breaking API changes only during major-version releases. Minor releases add features and patch releases are bug fixes. Any public API scheduled for removal is first deprecated in a minor release with a clear replacement path. The following is more information on the software components included in CUDA Python 1.0. library | description | next major version | cuda.binding | Low-level Python bindings to CUDA C APIs. | 13.3.0 | cuda.core | Pythonic access to CUDA Runtime and other core functionality | 1.0.0 | cccl-cuda | Pythonic access to CCCL parallel algorithms and easy access to CCCL’s highly efficient and customizable parallel algorithms | 1.0.0 | cuda-pathfinder | Utilities for locating CUDA components installed in the user’s Python environment | 1.6 | cuda.coop is also available in the cuda-cccl package under the experimental namespace, which is subject to API changes. cuda.coop provides the reusable block-wide and warp-wide device primitives for use within Numba CUDA kernels. cuda.core is now stable cuda.core provides a Pythonic interface to the CUDA runtime, including devices, streams, programs, linkers, memory resources, and graphs. Version 1.0 consolidates APIs that have been stabilizing over the previous release cycles into a single supported surface. At the same time, we added support for green contexts, CUDA checkpointing, and more. Green contexts: Split a GPU’s SMs into disjoint partitions, each with its own context and streams, so latency-sensitive kernels are shielded from long-running throughput kernels in the same process. Process checkpointing : Snapshot the full CUDA state of a running process—including device allocations, streams, context—and restore it later. Unlocks CRIU-style workflows for GPU processes: fault-tolerant long jobs, preemption and migration on shared clusters, and fast warm-start of inference workers. Only available in Linux. Inter-process sharing IPC : Share GPU memory across Python processes without copying through the host. One process allocates, and others map the same physical VRAM into their own address space. Ideal for multi-process ML serving and zero-copy producer/consumer pipelines. The following are quick examples of how to use cuda.core APIs. python from cuda.core import Device, Stream, Program, ProgramOptions, LaunchConfig, launch pick and activate a GPU dev = Device dev.set current create a CUDA stream stream = dev.create stream NVRTC compile + lookup prog = Program src, code type="c++", options = ProgramOptions arch=f"sm {dev.arch}" kernel = prog.compile "cubin" .get kernel "my kernel" launch a kernel launch stream, LaunchConfig grid=64, block=256 , kernel, args JIT-LTO linking from cuda.core import Linker, LinkerOptions module = Linker obj1, obj2 , options=LinkerOptions arch=f"sm {dev.arch}" .link "cubin" NVRTC precompiled headers from cuda.core import ProgramOptions opts = ProgramOptions std="c++17", arch=f"sm {dev.arch}", create pch=True, pch dir="/tmp/pch" Memory resources, incl. NUMA-aware pools from cuda.core import DeviceMemoryResource, PinnedMemoryResource, PinnedMemoryResourceOptions, ManagedMemoryResource, ManagedMemoryResourceOptions NUMA-pinned host memory pinned = PinnedMemoryResource PinnedMemoryResourceOptions numa id=0 CUDA graphs: stream capture and explicit construction from cuda.core.graph import GraphBuilder, GraphDef gb = stream.create graph builder gb.begin building graph = gb.end building .complete graph.launch stream gdef = GraphDef gdef.add kernel node kernel, LaunchConfig grid=64, block=256 , args=args IPC: share GPU memory across Python processes from cuda.core import DeviceMemoryResource, DeviceMemoryResourceOptions mr = DeviceMemoryResource dev, options=DeviceMemoryResourceOptions max size=1 << 20, ipc enabled=True buffer = mr.allocate nbytes buffer is picklable and can be sent over mp.Queue Green contexts: partition SMs into disjoint groups from cuda.core import ContextOptions, SMResourceOptions sm = dev.resources.sm long grp, crit grp = sm.split SMResourceOptions count= sm.sm count - 16, 16 0 ctx crit = dev.create context ContextOptions resources= crit grp s crit = ctx crit.create stream Process checkpoint / restore Linux from cuda.core import checkpoint proc = checkpoint.Process os.getpid proc.lock timeout ms=5000 proc.checkpoint proc.restore proc.unlock device allocations and context are restored TMA / TensorMapDescriptor from cuda.core import StridedMemoryView, TensorMapDescriptor tmap = StridedMemoryView tensor .as tensor map box shape= 128, DLPack-friendly strided views from cuda.core.utils import StridedMemoryView view = StridedMemoryView torch tensor ; capsule = view. dlpack System info NVML from cuda.core import system print system.num devices, system.driver version cuda.bindings.nvml from cuda.bindings import nvml nvml.init name = nvml.device get name nvml.device get handle by index v2 0 cuda.bindings.nvfatbin from cuda.bindings import nvfatbin handle = nvfatbin.create CCCL Python release 1.0.0: cuda.compute cuda.compute brings the CUDA Core Compute Libraries CCCL ’s highly tuned parallel algorithms—sort, scan, reduce, transform, unique, histogram, top-k, and more—to Python as host-callable building blocks. Changes since the last release include: - Python lambdas can be used as algorithm operators, reducing boilerplate for simple reductions, scans, transforms, and predicates. - Algorithms support operators with side effects state , enabling use cases like running accumulators and conditional transforms. - New cuda.compute.upper bound and cuda.compute.lower bound APIs expose CUB’s parallel binary search to Python. - Consolidated caching across all algorithms for faster repeated invocations. python import cuda.compute from cuda.compute import OpKind d input = cp.arange 1, 1 000 001, dtype=cp.int32 d output = cp.empty 1, dtype=cp.int32 h init = np.array 0 , dtype=np.int32 cuda.compute.reduce into d input, d output, OpKind.PLUS, d input.size, h init cuda.compute.reduce into d input, d output, lambda a, b: a if a b else b, d input.size, h init, cuda.coop exposes CCCL’s warp-wide and block-wide cooperative primitives for use inside Numba CUDA kernels. At the moment, this module is under the experimental namespace and may have API changes that don’t follow semantic versioning. python from numba import cuda from cuda.coop. experimental import block, warp THREADS = 128 block sum = coop.block.make sum numba.int32, THREADS @cuda.jit link=block sum.files def reduce kernel data, out : Each thread contributes one element to the block-wide reduction total = block sum data cuda.threadIdx.x if cuda.threadIdx.x == 0: out 0 = total h in = np.ones THREADS, dtype=np.int32 d in = cuda.to device h in d out = cuda.device array 1, dtype=np.int32 reduce kernel 1, THREADS d in, d out assert d out.copy to host 0 == THREADS 128 New Numba CUDA MLIR backend Numba CUDA MLIR is a new Numba-compatible kernel generator for Python, written from the ground up on top of MLIR and the modern NVVM toolchain. It preserves the familiar @cuda.jit programming model from Numba-CUDA while delivering lower compile latency, better diagnostics, and a cleaner path to target new GPU architectures and features as they land in the NVVM stack. Numba CUDA MLIR can be used as a drop-in replacement for numba.cuda by simply replacing the import statement: python Before from numba import cuda After from numba cuda mlir import cuda @cuda.jit def vector add a, b, out : i = cuda.grid 1 if i < out.shape 0 : out i = a i + b i Beyond existing Numba-CUDA compatibility Numba CUDA MLIR also features: Faster JIT compile https://github.com/NVIDIA/numba-cuda-mlir/blob/main/tests/benchmarks/README.md . Across a suite of real kernels vector add, softmax, Cholesky, attention, Black-Scholes, FFT, matmul , warm JIT compile times are ~1.4x faster on geomean and up to ~2x faster on individual kernels versus Numba-CUDA. Lower launch latency https://github.com/NVIDIA/numba-cuda-mlir/blob/main/tests/benchmarks/launch latency ubench/README.md . Host-side kernel dispatch overhead drops by roughly 2-3.5x for typical kernels and up to ~17x for kernels with many scalar arguments, where argument packing previously dominated. You can test Numba CUDA MLIR 0.3 by installing it from PyPI numba-cuda-mlir cu13 and follow its development on GitHub. Try CUDA Python today Install the CUDA Python stack directly from PyPI: pip install cuda-python cuda-cccl numba-cuda-mlir cu13 This pulls in cuda.bindings 13.3.0 , cuda.core 1.0.0 , cuda.compute 1.0.0 , along with cuda-pathfinder for library discovery. CompileIQ launched A new compiler auto-tuning framework for maximum performance on GPU kernels called CompileIQ, launches with CUDA 13.3. GPU compilers apply generic optimization heuristics that are broadly effective but aren’t necessarily optimal for specific kernels. CompileIQ flips this dynamic by using evolutionary and genetic algorithms to generate specialized compiler configurations custom-tailored to each kernel. This unlocks extra performance. For example, for critical kernels like GEMM and attention, which account for over 90% of LLM inference compute, CompileIQ delivers up to a 15% speedup on already-optimized Triton attention and CUTLASS GEMM kernels. Read more about CompileIQ, including how it works and how to use it, in this blog post. Math libraries Core CUDA math libraries in CUDA 13.3 include several new features and notable performance improvements available, including: - cuSPARSE: - Support for CSC format in SpSV and SpSM. - Support for mixed precision in SpMVOp. - Support for mixed index type 64-bit offset, 32-bit index CSR matrix in SpMvOp computation - Improved cusparseSpMVOp createDescr performance by 2.5x. - Introduced new API SPMVOP ALG1, which supports: - Updating matrix values while maintaining the same sparsity pattern. - Optimized buffer size. - Reduced preprocess overhead. - cuBLAS: - CUDA green context support. - Performance improvement to FP4 matmuls on NVIDIA Blackwell Ultra. - Performance improvement to TF32 matmuls on NVIDIA Blackwell and Blackwell Ultra. - SYMV performance improvements for NVIDIA Hopper, Blackwell, and Blackwell Ultra. - Improved user experience for FP64 emulated matmuls by enforcing a fixed workspace size that is constant across the problem space. - cuSOLVER: - A 64-bit interface cusolverDnXpolar exposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn - A 64-bit interface cusolverDnXstedc , which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method - Performance improvements for cusolverDnXgeev with eigenvectors by moving the eigenvector post-processing from the host to the device. - A 64-bit interface - Public 64-bit interface cusolverDnXpolar , which exposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn available in 13.2 U1 . - Public 64-bit interface cusolverDnXstedc , which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method available in 13.2 U1 . - Performance improvements for cusolverDnXgeev with eigenvectors by moving the eigenvector post-processing from the host to the device. cusolverDn D,Z syevj uses low-precision preconditioning, which typically improves the time-to-solution by 20% for mid-sized and large matrices on B200, and by even more on GPUs with a large FP32: FP64 ratio. CCCL CUDA 13.3 ships with CCCL 3.3. Highlights include DLPack/mdspan interoperability, a comprehensive random number distribution library, new search and segmented scan algorithms, and a flexible N-to-M transform. Tensor interoperability Deep learning frameworks speak in tensors, but CUDA C++ code often has to work one level lower—raw pointers, shapes, strides, and hand-written indexing. CCCL makes it easier to preserve that tensor structure across the boundary between Python frameworks and CUDA C++. With DLPack interoperability, tensors from frameworks such as PyTorch, JAX, and CuPy can be converted into cuda::std::mdspan views with cuda::to device mdspan https://nvidia.github.io/cccl/unstable/libcudacxx/extended api/mdspan/dlpack to mdspan.html conversion-functions for use in C++ kernels, and cuda::std::mdspan views can be converted back to DLPack with . https://nvidia.github.io/cccl/unstable/libcudacxx/extended api/mdspan/dlpack to mdspan.html conversion-functions cuda::to dlpack tensor CCCL also extends this tensor-view model inside kernels with cuda::shared memory mdspan https://nvidia.github.io/cccl/unstable/libcudacxx/extended api/mdspan/shared memory accessor.html . Instead of treating shared memory as a flat buffer, developers can create multi-dimensional views over shared-memory tiles, making indexing clearer and less error-prone. The shared-memory specialization also provides address-space safety checks and guarantees shared-memory load/store instructions. Random number distributions CCCL 3.3 adds a comprehensive set of device-compatible random distributions to <cuda/std/random https://nvidia.github.io/cccl/unstable/libcudacxx/standard api/numerics library/random.html , bringing libcu++ to near-parity with the C++ standard library’s <random header. CCCL 3.3 brings a comprehensive set of 17 random uniform, normal, Poisson, and Bernoulli distributions. In addition, CCCL 3.3 backports the cuda::std::philox4x32 and cuda::std::philox4x64 engines from C++26 to C++17 and adds cuda::pcg64 as an extension in <cuda/random . PCG64 is the default PRNG in Numpy and provides a good balance between quality and performance. include <cuda/random include <cuda/std/random global void sample kernel { cuda::pcg64 rng threadIdx.x ; cuda::std::normal distribution<float dist 0.0f, 1.0f ; float sample = dist rng ; } Search: cub::DeviceFind::FindIf CCCL 3.3 adds cub::DeviceFind::FindIf https://nvidia.github.io/cccl/unstable/cub/api/structcub 1 1DeviceFind.html CPPv4I0000EN3cub10DeviceFind6FindIfE11cudaError tPvR6size t14InputIteratorT15OutputIteratorT7ScanOpT9NumItemsT12cudaStream t , a new speed-of-light device-wide search algorithm for finding the first element that satisfies a predicate. cub::DeviceFind::FindIf d temp, temp bytes, input, output, device int value { return value 42; }, num items ; This algorithm delivers up to 7x speedup compared to the search implementation used in CCCL 3.2 and accelerates Thrust’s search and predicate-query algorithms, including thrust::find if , thrust::all of , thrust::any of , thrust::none of , thrust::equal , thrust::mismatch , thrust::is sorted , thrust::partition point , and more. More new algorithms in CCCL 3.3 include: - Segmented scan: provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments. cub::DeviceSegmentedScan - Binary search: / cub::DeviceFind::LowerBound UpperBound performs a parallel search for multiple values in an ordered sequence. - Transform: now supports transforming N input sequences into M output sequences. cub::DeviceTransform Compilers/NVCC C++23 support: Full C++23 integration in nvcc and nvrtc empowers developers to use the latest language standard.This release modernizes the CUDA development experience, ensuring codebase consistency with modern standards while significantly improving cross-platform portability. - Enhanced nvrtc out-of-the-box experience: By bundling standard CUDA C++ headers, NVRTC streamlines the runtime compilation process and reduces prerequisite setup.This update simplifies include-path management, enabling faster implementation of portable and robust runtime compilation workflows. - Integrated nvprune in nvcc : The inclusion of pruning capabilities directly within the compiler allows for more efficient artifact management and simplified multi-arch deployment. More CUDA 13.3 enhancements More enhancements in CUDA 13.3 are detailed in this section. MPS partial error isolation MPS has added support for partial error isolation. When using this feature, the CUDA driver can attribute the error to the faulting partition/client and terminate that client’s work, while other clients in other partitions that did not cause the fault won’t be terminated. For more info on how to use this feature, see the release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html . Enable graph recapture to an existing graph In CUDA graphs, a new API cudaStreamBeginRecaptureToGraph https://docs.nvidia.com/cuda/cuda-runtime-api/group CUDART STREAM.html group CUDART STREAM 1g980baa726cb9a77b21ed8f58a1e75b97 enables you to initiate a stream capture into an existing source graph. As the graph is recaptured, any updated node parameters will be updated in the existing node. Default stream creation is optional in green contexts Green Contexts used in the CUDA Driver API no longer require the creation of the default NULL stream via the CU GREEN CTX DEFAULT STREAM flag. Creation of this stream is now optional. NVML reports inactive remapped rows A new NVML API, nvmlDeviceGetRemappedRows v2 , can acquire the number of inactive row remappings while the old API, nvmlDeviceGetRemappedRows , now returns only the number of active row remappings. Added mmap support This release extends mmap support, providing a low-latency CPU mapping of discrete GPU memory in environments where it may be disadvantageous to install GDRCopy https://github.com/NVIDIA/gdrcopy kernel drivers. Get started Download CUDA Toolkit 13.3 https://developer.nvidia.com/cuda-downloads and get started today. Acknowledgments Thanks to NVIDIA contributors Andy Terrel, Rob Armstrong, Jackson Marusarz, Becca Zandstein, Mridula Prakash, Daniel Rodriguez, and Georgii Evtushenko.