NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in C++, enables high-level, tile-based kernel development that automatically manages complex low-level GPU details for optimal performance and portability. Additionally, CUDA Tile programming is now supported on Compute Capability 9.0 (NVIDIA Hopper) GPUs in addition to all other supported GPU architectures.
We are also releasing CUDA Python 1.0, solidifying the support and stability of the CUDA Python SW ecosystem, and introducing critical features like green contexts and process checkpointing.
For performance enthusiasts, the newly launched NVIDIA CompileIQ compiler auto-tuning framework delivers up to a 15% speedup on critical kernels like GEMM and attention. This release also features official C++23 support in NVCC, expanded tensor interoperability with DLPack/mdspan in CCCL 3.3, and numerous updates to the math libraries (cuBLAS, cuSPARSE, cuSOLVER) and profiling tools (Nsight Compute and Nsight Systems).
Release of CUDA Tile C++ #
With the release of CUDA 13.3, CUDA Tile support is extended to C++, enabling the large existing C++ codebase and developer base to create highly-optimized GPU tile kernels. This model automates parallelism, memory movement, asynchrony, and other low-level details, resulting in C++ code that is portable across NVIDIA GPU architectures. For more information, check out our blog post.
Release of CUDA Python 1.0 #
CUDA Python is a set of libraries that expose CUDA to the Python programming language. By providing the 1.0 release, we are committing to semantic versioning: ensuring breaking API changes only during major-version releases. Minor releases add features and patch releases are bug fixes. Any public API scheduled for removal is first deprecated in a minor release with a clear replacement path.
The following is more information on the software components included in CUDA Python 1.0.
library | description | next major version |
cuda.binding | Low-level Python bindings to CUDA C APIs. | 13.3.0 |
cuda.core | Pythonic access to CUDA Runtime and other core functionality | 1.0.0 |
cccl-cuda | Pythonic access to CCCL parallel algorithms and easy access to CCCL’s highly efficient and customizable parallel algorithms | 1.0.0 |
cuda-pathfinder | Utilities for locating CUDA components installed in the user’s Python environment | 1.6 |
cuda.coop
is also available in the cuda-cccl
package under the _experimental
namespace, which is subject to API changes. cuda.coop
provides the reusable block-wide and warp-wide device primitives for use within Numba CUDA kernels.
cuda.core
is now stable
cuda.core
provides a Pythonic interface to the CUDA runtime, including devices, streams, programs, linkers, memory resources, and graphs. Version 1.0 consolidates APIs that have been stabilizing over the previous release cycles into a single supported surface. At the same time, we added support for green contexts, CUDA checkpointing, and more.
Green contexts: Split a GPU’s SMs into disjoint partitions, each with its own context and streams, so latency-sensitive kernels are shielded from long-running throughput kernels in the same process.Process checkpointing: Snapshot the full CUDA state of a running process—including device allocations, streams, context—and restore it later. Unlocks CRIU-style workflows for GPU processes: fault-tolerant long jobs, preemption and migration on shared clusters, and fast warm-start of inference workers. Only available in Linux.Inter-process sharing (IPC): Share GPU memory across Python processes without copying through the host. One process allocates, and others map the same physical VRAM into their own address space. Ideal for multi-process ML serving and zero-copy producer/consumer pipelines.
The following are quick examples of how to use cuda.core
APIs.
from cuda.core import Device, Stream, Program, ProgramOptions, LaunchConfig, launch
dev = Device()
dev.set_current()
stream = dev.create_stream()
prog = Program(src, code_type="c++", options = ProgramOptions(arch=f"sm_{dev.arch}"))
kernel = prog.compile("cubin").get_kernel("my_kernel")
launch(stream, LaunchConfig(grid=64, block=256), kernel, *args)
from cuda.core import Linker, LinkerOptions
module = Linker(
[obj1, obj2],
options=LinkerOptions(arch=f"sm_{dev.arch}")
).link("cubin")
from cuda.core import ProgramOptions
opts = ProgramOptions(std="c++17", arch=f"sm_{dev.arch}", create_pch=True, pch_dir="/tmp/pch")
from cuda.core import DeviceMemoryResource, PinnedMemoryResource, PinnedMemoryResourceOptions, ManagedMemoryResource, ManagedMemoryResourceOptions
pinned = PinnedMemoryResource(PinnedMemoryResourceOptions(numa_id=0))
from cuda.core.graph import GraphBuilder, GraphDef
gb = stream.create_graph_builder()
gb.begin_building()
graph = gb.end_building().complete()
graph.launch(stream)
gdef = GraphDef()
gdef.add_kernel_node(kernel, LaunchConfig(grid=64, block=256), args=args)
from cuda.core import DeviceMemoryResource, DeviceMemoryResourceOptions
mr = DeviceMemoryResource(dev,
options=DeviceMemoryResourceOptions(max_size=1 << 20, ipc_enabled=True))
buffer = mr.allocate(nbytes) # buffer is picklable and can be sent over mp.Queue
from cuda.core import ContextOptions, SMResourceOptions
sm = dev.resources.sm
long_grp, crit_grp = sm.split(SMResourceOptions(count=(sm.sm_count - 16, 16)))[0]
ctx_crit = dev.create_context(ContextOptions(resources=[crit_grp]))
s_crit = ctx_crit.create_stream()
from cuda.core import checkpoint
proc = checkpoint.Process(os.getpid())
proc.lock(timeout_ms=5000)
proc.checkpoint()
proc.restore()
proc.unlock()
from cuda.core import StridedMemoryView, TensorMapDescriptor
tmap = StridedMemoryView(tensor).as_tensor_map(box_shape=(128,))
from cuda.core.utils import StridedMemoryView
view = StridedMemoryView(torch_tensor); capsule = view.__dlpack__()
from cuda.core import system
print(system.num_devices, system.driver_version)
from cuda.bindings import nvml
nvml.init()
name = nvml.device_get_name(nvml.device_get_handle_by_index_v2(0))
from cuda.bindings import nvfatbin
handle = nvfatbin.create()
CCCL Python release 1.0.0: cuda.compute
cuda.compute
brings the CUDA Core Compute Libraries (CCCL)’s highly tuned parallel algorithms—sort, scan, reduce, transform, unique, histogram, top-k, and more—to Python as host-callable building blocks. Changes since the last release include:
- Python lambdas can be used as algorithm operators, reducing boilerplate for simple reductions, scans, transforms, and predicates.
- Algorithms support operators with side effects (state), enabling use cases like running accumulators and conditional transforms.
- New
cuda.compute.upper_bound
andcuda.compute.lower_bound
APIs expose CUB’s parallel binary search to Python. - Consolidated caching across all algorithms for faster repeated invocations.
import cuda.compute
from cuda.compute import OpKind
d_input = cp.arange(1, 1_000_001, dtype=cp.int32)
d_output = cp.empty(1, dtype=cp.int32)
h_init = np.array([0], dtype=np.int32)
cuda.compute.reduce_into(
d_input, d_output, OpKind.PLUS, d_input.size, h_init
)
cuda.compute.reduce_into(
d_input, d_output,
lambda a, b: a if a > b else b,
d_input.size, h_init,
)
cuda.coop
exposes CCCL’s warp-wide and block-wide cooperative primitives for use inside Numba CUDA kernels. At the moment, this module is under the _experimental
namespace and may have API changes that don’t follow semantic versioning.
from numba import cuda
from cuda.coop._experimental import block, warp
THREADS = 128
block_sum = coop.block.make_sum(numba.int32, THREADS)
@cuda.jit(link=block_sum.files)
def reduce_kernel(data, out):
total = block_sum(data[cuda.threadIdx.x])
if cuda.threadIdx.x == 0:
out[0] = total
h_in = np.ones(THREADS, dtype=np.int32)
d_in = cuda.to_device(h_in)
d_out = cuda.device_array(1, dtype=np.int32)
reduce_kernel[1, THREADS](d_in, d_out)
assert d_out.copy_to_host()[0] == THREADS # 128
New Numba CUDA MLIR backend
Numba CUDA MLIR is a new Numba-compatible kernel generator for Python, written from the ground up on top of MLIR and the modern NVVM toolchain. It preserves the familiar @cuda.jit
programming model from Numba-CUDA while delivering lower compile latency, better diagnostics, and a cleaner path to target new GPU architectures and features as they land in the NVVM stack. Numba CUDA MLIR can be used as a drop-in replacement for numba.cuda
by simply replacing the import statement:
from numba import cuda
from numba_cuda_mlir import cuda
@cuda.jit
def vector_add(a, b, out):
i = cuda.grid(1)
if i < out.shape[0]:
out[i] = a[i] + b[i]
Beyond existing Numba-CUDA compatibility Numba CUDA MLIR also features:
Faster JIT compile. Across a suite of real kernels (vector add, softmax, Cholesky, attention, Black-Scholes, FFT, matmul), warm JIT compile times are ~1.4x faster on geomean and up to ~2x faster on individual kernels versus Numba-CUDA.Lower launch latency. Host-side kernel dispatch overhead drops by roughly 2-3.5x for typical kernels and up to ~17x for kernels with many scalar arguments, where argument packing previously dominated.
You can test Numba CUDA MLIR 0.3 by installing it from PyPI numba-cuda-mlir[cu13]
and follow its development on GitHub.
Try CUDA Python today #
Install the CUDA Python stack directly from PyPI:
pip install cuda-python cuda-cccl numba-cuda-mlir[cu13]
This pulls in cuda.bindings 13.3.0
, cuda.core 1.0.0
, cuda.compute 1.0.0
, along with cuda-pathfinder
for library discovery.
CompileIQ launched #
A new compiler auto-tuning framework for maximum performance on GPU kernels called CompileIQ, launches with CUDA 13.3. GPU compilers apply generic optimization heuristics that are broadly effective but aren’t necessarily optimal for specific kernels. CompileIQ flips this dynamic by using evolutionary and genetic algorithms to generate specialized compiler configurations custom-tailored to each kernel.
This unlocks extra performance. For example, for critical kernels like GEMM and attention, which account for over 90% of LLM inference compute, CompileIQ delivers up to a 15% speedup on already-optimized Triton attention and CUTLASS GEMM kernels. Read more about CompileIQ, including how it works and how to use it, in this blog post.
Math libraries #
Core CUDA math libraries in CUDA 13.3 include several new features and notable performance improvements available, including:
- cuSPARSE:
- Support for CSC format in SpSV and SpSM.
- Support for mixed precision in SpMVOp.
- Support for mixed index type (64-bit offset, 32-bit index) CSR matrix in SpMvOp computation
- Improved
cusparseSpMVOp_createDescr()
performance by 2.5x. - Introduced new API SPMVOP_ALG1, which supports:
-
Updating matrix values while maintaining the same sparsity pattern.
-
Optimized buffer size.
-
Reduced preprocess overhead.
-
cuBLAS:
-
CUDA green context support.
-
Performance improvement to FP4 matmuls on NVIDIA Blackwell Ultra.
-
Performance improvement to TF32 matmuls on NVIDIA Blackwell and Blackwell Ultra.
-
SYMV performance improvements for NVIDIA Hopper, Blackwell, and Blackwell Ultra.
-
Improved user experience for FP64 emulated matmuls by enforcing a fixed workspace size that is constant across the problem space.
-
cuSOLVER:
-
A 64-bit interface
cusolverDnXpolar
exposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn - A 64-bit interface
cusolverDnXstedc
, which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method - Performance improvements for
cusolverDnXgeev
with eigenvectors by moving the eigenvector post-processing from the host to the device.
-
A 64-bit interface
-
Public 64-bit interface
cusolverDnXpolar
, which exposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn (available in 13.2 U1). - Public 64-bit interface
cusolverDnXstedc
, which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method (available in 13.2 U1). - Performance improvements for
cusolverDnXgeev
with eigenvectors by moving the eigenvector post-processing from the host to the device. cusolverDn[D,Z]syevj
uses low-precision preconditioning, which typically improves the time-to-solution by 20% for mid-sized and large matrices on B200, and by even more on GPUs with a large FP32: FP64 ratio.
CCCL #
CUDA 13.3 ships with CCCL 3.3. Highlights include DLPack/mdspan interoperability, a comprehensive random number distribution library, new search and segmented scan algorithms, and a flexible N-to-M transform.
Tensor interoperability
Deep learning frameworks speak in tensors, but CUDA C++ code often has to work one level lower—raw pointers, shapes, strides, and hand-written indexing. CCCL makes it easier to preserve that tensor structure across the boundary between Python frameworks and CUDA C++. With DLPack interoperability, tensors from frameworks such as PyTorch, JAX, and CuPy can be converted into cuda::std::mdspan
views with cuda::to_device_mdspan for use in C++ kernels, and
cuda::std::mdspan
views can be converted back to DLPack with .
cuda::to_dlpack_tensor
CCCL also extends this tensor-view model inside kernels with cuda::shared_memory_mdspan. Instead of treating shared memory as a flat buffer, developers can create multi-dimensional views over shared-memory tiles, making indexing clearer and less error-prone. The shared-memory specialization also provides address-space safety checks and guarantees shared-memory load/store instructions.
Random number distributions
CCCL 3.3 adds a comprehensive set of device-compatible random distributions to <cuda/std/random>, bringing libcu++ to near-parity with the C++ standard library’s
<random>
header. CCCL 3.3 brings a comprehensive set of 17 random uniform, normal, Poisson, and Bernoulli distributions. In addition, CCCL 3.3 backports the cuda::std::philox4x32
and cuda::std::philox4x64
engines from C++26 to C++17 and adds cuda::pcg64
as an extension in <cuda/random>
. PCG64 is the default PRNG in Numpy and provides a good balance between quality and performance.
#include <cuda/random>
#include <cuda/std/random>
__global__ void sample_kernel() {
cuda::pcg64 rng(threadIdx.x);
cuda::std::normal_distribution<float> dist(0.0f, 1.0f);
float sample = dist(rng);
}
Search: cub::DeviceFind::FindIf
CCCL 3.3 adds cub::DeviceFind::FindIf, a new speed-of-light device-wide search algorithm for finding the first element that satisfies a predicate.
cub::DeviceFind::FindIf(
d_temp, temp_bytes, input, output,
[] __device__ (int value) {
return value > 42;
}, num_items);
This algorithm delivers up to 7x speedup compared to the search implementation used in CCCL 3.2 and accelerates Thrust’s search and predicate-query algorithms, including thrust::find_if
, thrust::all_of
, thrust::any_of
, thrust::none_of
, thrust::equal
, thrust::mismatch
, thrust::is_sorted
, thrust::partition_point
, and more.
More new algorithms in CCCL 3.3 include:
-
Segmented scan: provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.
cub::DeviceSegmentedScan -
Binary search: /
cub::DeviceFind::LowerBound
UpperBound
performs a parallel search for multiple values in an ordered sequence. - Transform:
now supports transforming N input sequences into M output sequences.cub::DeviceTransform
Compilers/NVCC #
C++23 support: Full C++23 integration in nvcc
and nvrtc
empowers developers to use the latest language standard.This release modernizes the CUDA development experience, ensuring codebase consistency with modern standards while significantly improving cross-platform portability.
- Enhanced
nvrtc
out-of-the-box experience: By bundling standard CUDA C++ headers, NVRTC streamlines the runtime compilation process and reduces prerequisite setup.This update simplifies include-path management, enabling faster implementation of portable and robust runtime compilation workflows. - Integrated nvprune in
nvcc
: The inclusion of pruning capabilities directly within the compiler allows for more efficient artifact management and simplified multi-arch deployment.
More CUDA 13.3 enhancements #
More enhancements in CUDA 13.3 are detailed in this section.
MPS partial error isolation
MPS has added support for partial error isolation. When using this feature, the CUDA driver can attribute the error to the faulting partition/client and terminate that client’s work, while other clients in other partitions that did not cause the fault won’t be terminated. For more info on how to use this feature, see the release notes.
Enable graph recapture to an existing graph
In CUDA graphs, a new API cudaStreamBeginRecaptureToGraph() enables you to initiate a stream capture into an existing source graph. As the graph is recaptured, any updated node parameters will be updated in the existing node.
Default stream creation is optional in green contexts
Green Contexts used in the CUDA Driver API no longer require the creation of the default (NULL) stream via the CU_GREEN_CTX_DEFAULT_STREAM
flag. Creation of this stream is now optional.
NVML reports inactive remapped rows
A new NVML API, nvmlDeviceGetRemappedRows_v2
, can acquire the number of inactive row remappings while the old API, nvmlDeviceGetRemappedRows
, now returns only the number of active row remappings.
Added mmap()
support
This release extends mmap()
support, providing a low-latency CPU mapping of discrete GPU memory in environments where it may be disadvantageous to install GDRCopy kernel drivers.
Get started #
Download CUDA Toolkit 13.3 and get started today.
Acknowledgments
Thanks to NVIDIA contributors Andy Terrel, Rob Armstrong, Jackson Marusarz, Becca Zandstein, Mridula Prakash, Daniel Rodriguez, and Georgii Evtushenko.