NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates

wpnews.pro

NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in C++, enables high-level, tile-based kernel development that automatically manages complex low-level GPU details for optimal performance and portability. Additionally, CUDA Tile programming is now supported on Compute Capability 9.0 (NVIDIA Hopper) GPUs in addition to all other supported GPU architectures.

We are also releasing CUDA Python 1.0, solidifying the support and stability of the CUDA Python SW ecosystem, and introducing critical features like green contexts and process checkpointing.

For performance enthusiasts, the newly launched NVIDIA CompileIQ compiler auto-tuning framework delivers up to a 15% speedup on critical kernels like GEMM and attention. This release also features official C++23 support in NVCC, expanded tensor interoperability with DLPack/mdspan in CCCL 3.3, and numerous updates to the math libraries (cuBLAS, cuSPARSE, cuSOLVER) and profiling tools (Nsight Compute and Nsight Systems).

Release of CUDA Tile C++ #

With the release of CUDA 13.3, CUDA Tile support is extended to C++, enabling the large existing C++ codebase and developer base to create highly-optimized GPU tile kernels. This model automates parallelism, memory movement, asynchrony, and other low-level details, resulting in C++ code that is portable across NVIDIA GPU architectures. For more information, check out our blog post.

Release of CUDA Python 1.0 #

CUDA Python is a set of libraries that expose CUDA to the Python programming language. By providing the 1.0 release, we are committing to semantic versioning: ensuring breaking API changes only during major-version releases. Minor releases add features and patch releases are bug fixes. Any public API scheduled for removal is first deprecated in a minor release with a clear replacement path.

The following is more information on the software components included in CUDA Python 1.0.

cuda.coop

is also available in the cuda-cccl

package under the _experimental

namespace, which is subject to API changes. cuda.coop

provides the reusable block-wide and warp-wide device primitives for use within Numba CUDA kernels.

cuda.core

is now stable

cuda.core

provides a Pythonic interface to the CUDA runtime, including devices, streams, programs, linkers, memory resources, and graphs. Version 1.0 consolidates APIs that have been stabilizing over the previous release cycles into a single supported surface. At the same time, we added support for green contexts, CUDA checkpointing, and more.

Green contexts: Split a GPU’s SMs into disjoint partitions, each with its own context and streams, so latency-sensitive kernels are shielded from long-running throughput kernels in the same process.Process checkpointing: Snapshot the full CUDA state of a running process—including device allocations, streams, context—and restore it later. Unlocks CRIU-style workflows for GPU processes: fault-tolerant long jobs, preemption and migration on shared clusters, and fast warm-start of inference workers. Only available in Linux.Inter-process sharing (IPC): Share GPU memory across Python processes without copying through the host. One process allocates, and others map the same physical VRAM into their own address space. Ideal for multi-process ML serving and zero-copy producer/consumer pipelines.

The following are quick examples of how to use cuda.core

APIs.

from cuda.core import Device, Stream, Program, ProgramOptions, LaunchConfig, launch

dev = Device()
dev.set_current()

stream = dev.create_stream()

prog = Program(src, code_type="c++", options = ProgramOptions(arch=f"sm_{dev.arch}"))
kernel = prog.compile("cubin").get_kernel("my_kernel")                       

launch(stream, LaunchConfig(grid=64, block=256), kernel, *args)

from cuda.core import Linker, LinkerOptions

module = Linker(
    [obj1, obj2],           
    options=LinkerOptions(arch=f"sm_{dev.arch}")
).link("cubin")

from cuda.core import ProgramOptions

opts = ProgramOptions(std="c++17", arch=f"sm_{dev.arch}", create_pch=True, pch_dir="/tmp/pch")

from cuda.core import DeviceMemoryResource, PinnedMemoryResource, PinnedMemoryResourceOptions, ManagedMemoryResource, ManagedMemoryResourceOptions

pinned = PinnedMemoryResource(PinnedMemoryResourceOptions(numa_id=0))


from cuda.core.graph import GraphBuilder, GraphDef

gb = stream.create_graph_builder()
gb.begin_building()
graph = gb.end_building().complete()
graph.launch(stream)
gdef = GraphDef()
gdef.add_kernel_node(kernel, LaunchConfig(grid=64, block=256), args=args)

from cuda.core import DeviceMemoryResource, DeviceMemoryResourceOptions

mr = DeviceMemoryResource(dev,
        options=DeviceMemoryResourceOptions(max_size=1 << 20, ipc_enabled=True))
buffer = mr.allocate(nbytes)   # buffer is picklable and can be sent over mp.Queue

from cuda.core import ContextOptions, SMResourceOptions
sm = dev.resources.sm
long_grp, crit_grp = sm.split(SMResourceOptions(count=(sm.sm_count - 16, 16)))[0]
ctx_crit = dev.create_context(ContextOptions(resources=[crit_grp]))
s_crit = ctx_crit.create_stream()

from cuda.core import checkpoint

proc = checkpoint.Process(os.getpid())
proc.lock(timeout_ms=5000)
proc.checkpoint()
proc.restore()
proc.unlock()

from cuda.core import StridedMemoryView, TensorMapDescriptor
tmap = StridedMemoryView(tensor).as_tensor_map(box_shape=(128,))

from cuda.core.utils import StridedMemoryView
view = StridedMemoryView(torch_tensor); capsule = view.__dlpack__()

from cuda.core import system
print(system.num_devices, system.driver_version)

from cuda.bindings import nvml

nvml.init()
name = nvml.device_get_name(nvml.device_get_handle_by_index_v2(0))

from cuda.bindings import nvfatbin

handle = nvfatbin.create()

CCCL Python release 1.0.0: `cuda.compute`

cuda.compute

brings the CUDA Core Compute Libraries (CCCL)’s highly tuned parallel algorithms—sort, scan, reduce, transform, unique, histogram, top-k, and more—to Python as host-callable building blocks. Changes since the last release include:

Python lambdas can be used as algorithm operators, reducing boilerplate for simple reductions, scans, transforms, and predicates.
Algorithms support operators with side effects (state), enabling use cases like running accumulators and conditional transforms.
New cuda.compute.upper_bound

andcuda.compute.lower_bound

APIs expose CUB’s parallel binary search to Python. - Consolidated caching across all algorithms for faster repeated invocations.

import cuda.compute
from cuda.compute import OpKind

d_input = cp.arange(1, 1_000_001, dtype=cp.int32)
d_output = cp.empty(1, dtype=cp.int32)
h_init = np.array([0], dtype=np.int32)

cuda.compute.reduce_into(
    d_input, d_output, OpKind.PLUS, d_input.size, h_init
)

cuda.compute.reduce_into(
    d_input, d_output,
    lambda a, b: a if a > b else b,
    d_input.size, h_init,
)

cuda.coop

exposes CCCL’s warp-wide and block-wide cooperative primitives for use inside Numba CUDA kernels. At the moment, this module is under the _experimental

namespace and may have API changes that don’t follow semantic versioning.

from numba import cuda
from cuda.coop._experimental import block, warp

THREADS = 128
block_sum = coop.block.make_sum(numba.int32, THREADS)

@cuda.jit(link=block_sum.files)
def reduce_kernel(data, out):
    total = block_sum(data[cuda.threadIdx.x])
    if cuda.threadIdx.x == 0:
        out[0] = total

h_in = np.ones(THREADS, dtype=np.int32)
d_in = cuda.to_device(h_in)
d_out = cuda.device_array(1, dtype=np.int32)

reduce_kernel[1, THREADS](d_in, d_out)

assert d_out.copy_to_host()[0] == THREADS  # 128

New Numba CUDA MLIR backend

Numba CUDA MLIR is a new Numba-compatible kernel generator for Python, written from the ground up on top of MLIR and the modern NVVM toolchain. It preserves the familiar @cuda.jit

programming model from Numba-CUDA while delivering lower compile latency, better diagnostics, and a cleaner path to target new GPU architectures and features as they land in the NVVM stack. Numba CUDA MLIR can be used as a drop-in replacement for numba.cuda

by simply replacing the import statement:

from numba import cuda

from numba_cuda_mlir import cuda

@cuda.jit
def vector_add(a, b, out):
    i = cuda.grid(1)
    if i < out.shape[0]:
        out[i] = a[i] + b[i]

Beyond existing Numba-CUDA compatibility Numba CUDA MLIR also features:

Faster JIT compile. Across a suite of real kernels (vector add, softmax, Cholesky, attention, Black-Scholes, FFT, matmul), warm JIT compile times are ~1.4x faster on geomean and up to ~2x faster on individual kernels versus Numba-CUDA.Lower launch latency. Host-side kernel dispatch overhead drops by roughly 2-3.5x for typical kernels and up to ~17x for kernels with many scalar arguments, where argument packing previously dominated.

You can test Numba CUDA MLIR 0.3 by installing it from PyPI numba-cuda-mlir[cu13]

and follow its development on GitHub.

Try CUDA Python today #

Install the CUDA Python stack directly from PyPI:

pip install cuda-python cuda-cccl numba-cuda-mlir[cu13]

This pulls in cuda.bindings 13.3.0

, cuda.core 1.0.0

, cuda.compute 1.0.0

, along with cuda-pathfinder

for library discovery.

CompileIQ launched #

A new compiler auto-tuning framework for maximum performance on GPU kernels called CompileIQ, launches with CUDA 13.3. GPU compilers apply generic optimization heuristics that are broadly effective but aren’t necessarily optimal for specific kernels. CompileIQ flips this dynamic by using evolutionary and genetic algorithms to generate specialized compiler configurations custom-tailored to each kernel.

This unlocks extra performance. For example, for critical kernels like GEMM and attention, which account for over 90% of LLM inference compute, CompileIQ delivers up to a 15% speedup on already-optimized Triton attention and CUTLASS GEMM kernels. Read more about CompileIQ, including how it works and how to use it, in this blog post.

Math libraries #

Core CUDA math libraries in CUDA 13.3 include several new features and notable performance improvements available, including:

cuSPARSE:
Support for CSC format in SpSV and SpSM.
Support for mixed precision in SpMVOp.
Support for mixed index type (64-bit offset, 32-bit index) CSR matrix in SpMvOp computation
Improved cusparseSpMVOp_createDescr()

performance by 2.5x. - Introduced new API SPMVOP_ALG1, which supports:

Updating matrix values while maintaining the same sparsity pattern.
Optimized buffer size.
Reduced preprocess overhead.
cuBLAS:
CUDA green context support.
Performance improvement to FP4 matmuls on NVIDIA Blackwell Ultra.
Performance improvement to TF32 matmuls on NVIDIA Blackwell and Blackwell Ultra.
SYMV performance improvements for NVIDIA Hopper, Blackwell, and Blackwell Ultra.
Improved user experience for FP64 emulated matmuls by enforcing a fixed workspace size that is constant across the problem space.
cuSOLVER:
A 64-bit interface cusolverDnXpolar

exposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn - A 64-bit interface cusolverDnXstedc

, which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method - Performance improvements for cusolverDnXgeev

with eigenvectors by moving the eigenvector post-processing from the host to the device.

A 64-bit interface
Public 64-bit interface cusolverDnXpolar

, which exposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn (available in 13.2 U1). - Public 64-bit interface cusolverDnXstedc

, which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method (available in 13.2 U1). - Performance improvements for cusolverDnXgeev

with eigenvectors by moving the eigenvector post-processing from the host to the device. cusolverDn[D,Z]syevj

uses low-precision preconditioning, which typically improves the time-to-solution by 20% for mid-sized and large matrices on B200, and by even more on GPUs with a large FP32: FP64 ratio.

CCCL #

CUDA 13.3 ships with CCCL 3.3. Highlights include DLPack/mdspan interoperability, a comprehensive random number distribution library, new search and segmented scan algorithms, and a flexible N-to-M transform.

Tensor interoperability

Deep learning frameworks speak in tensors, but CUDA C++ code often has to work one level lower—raw pointers, shapes, strides, and hand-written indexing. CCCL makes it easier to preserve that tensor structure across the boundary between Python frameworks and CUDA C++. With DLPack interoperability, tensors from frameworks such as PyTorch, JAX, and CuPy can be converted into cuda::std::mdspan

views with cuda::to_device_mdspan for use in C++ kernels, and

cuda::std::mdspan

views can be converted back to DLPack with .

cuda::to_dlpack_tensor

CCCL also extends this tensor-view model inside kernels with cuda::shared_memory_mdspan. Instead of treating shared memory as a flat buffer, developers can create multi-dimensional views over shared-memory tiles, making indexing clearer and less error-prone. The shared-memory specialization also provides address-space safety checks and guarantees shared-memory load/store instructions.

Random number distributions

CCCL 3.3 adds a comprehensive set of device-compatible random distributions to <cuda/std/random>, bringing libcu++ to near-parity with the C++ standard library’s

<random>

header. CCCL 3.3 brings a comprehensive set of 17 random uniform, normal, Poisson, and Bernoulli distributions. In addition, CCCL 3.3 backports the cuda::std::philox4x32

and cuda::std::philox4x64

engines from C++26 to C++17 and adds cuda::pcg64

as an extension in <cuda/random>

. PCG64 is the default PRNG in Numpy and provides a good balance between quality and performance.

#include <cuda/random>
#include <cuda/std/random>

__global__ void sample_kernel() {
    cuda::pcg64 rng(threadIdx.x);
    cuda::std::normal_distribution<float> dist(0.0f, 1.0f);
    float sample = dist(rng);
}

Search: `cub::DeviceFind::FindIf`

CCCL 3.3 adds cub::DeviceFind::FindIf, a new speed-of-light device-wide search algorithm for finding the first element that satisfies a predicate.

cub::DeviceFind::FindIf(
  d_temp, temp_bytes, input, output, 
  [] __device__ (int value) {
    return value > 42;
  }, num_items);

This algorithm delivers up to 7x speedup compared to the search implementation used in CCCL 3.2 and accelerates Thrust’s search and predicate-query algorithms, including thrust::find_if

, thrust::all_of

, thrust::any_of

, thrust::none_of

, thrust::equal

, thrust::mismatch

, thrust::is_sorted

, thrust::partition_point

, and more.

More new algorithms in CCCL 3.3 include:

Segmented scan: provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.cub::DeviceSegmentedScan
Binary search: /cub::DeviceFind::LowerBound

UpperBound

performs a parallel search for multiple values in an ordered sequence. - Transform: now supports transforming N input sequences into M output sequences.cub::DeviceTransform

Compilers/NVCC #

C++23 support: Full C++23 integration in nvcc

and nvrtc

empowers developers to use the latest language standard.This release modernizes the CUDA development experience, ensuring codebase consistency with modern standards while significantly improving cross-platform portability.

Enhanced nvrtc

out-of-the-box experience: By bundling standard CUDA C++ headers, NVRTC streamlines the runtime compilation process and reduces prerequisite setup.This update simplifies include-path management, enabling faster implementation of portable and robust runtime compilation workflows. - Integrated nvprune in nvcc

: The inclusion of pruning capabilities directly within the compiler allows for more efficient artifact management and simplified multi-arch deployment.

More CUDA 13.3 enhancements #

More enhancements in CUDA 13.3 are detailed in this section.

MPS partial error isolation

MPS has added support for partial error isolation. When using this feature, the CUDA driver can attribute the error to the faulting partition/client and terminate that client’s work, while other clients in other partitions that did not cause the fault won’t be terminated. For more info on how to use this feature, see the release notes.

Enable graph recapture to an existing graph

In CUDA graphs, a new API cudaStreamBeginRecaptureToGraph() enables you to initiate a stream capture into an existing source graph. As the graph is recaptured, any updated node parameters will be updated in the existing node.

Default stream creation is optional in green contexts

Green Contexts used in the CUDA Driver API no longer require the creation of the default (NULL) stream via the CU_GREEN_CTX_DEFAULT_STREAM

flag. Creation of this stream is now optional.

NVML reports inactive remapped rows

A new NVML API, nvmlDeviceGetRemappedRows_v2

, can acquire the number of inactive row remappings while the old API, nvmlDeviceGetRemappedRows

, now returns only the number of active row remappings.

Added `mmap()`

support

This release extends mmap()

support, providing a low-latency CPU mapping of discrete GPU memory in environments where it may be disadvantageous to install GDRCopy kernel drivers.

Get started #

Download CUDA Toolkit 13.3 and get started today.

Acknowledgments

Thanks to NVIDIA contributors Andy Terrel, Rob Armstrong, Jackson Marusarz, Becca Zandstein, Mridula Prakash, Daniel Rodriguez, and Georgii Evtushenko.

source & further reading

developer.nvidia.com — original article Reducing High-Bandwidth Memory Bottlenecks in JAX-Based LLM Training with Host Offloading Kernel Fusion in NVIDIA CUDA: Optimizing Memory Traffic and Launch Overhead AI Model Co-Design: Hardware-Friendly LLM Design