CCCL Runtime: A Modern C++ Runtime for CUDA NVIDIA released CCCL Runtime, a modern C++ runtime for CUDA, as part of CUDA 13.2. The new APIs provide safer and more convenient abstractions for stream management, memory allocation, and kernel launches, leveraging modern C++ features and lessons from 20 years of CUDA evolution. The NVIDIA CUDA Core Compute Libraries CCCL https://github.com/NVIDIA/cccl provides delightful and efficient abstractions for CUDA developers in C++ and Python. It features: Parallel algorithms – Host-launched algorithms including sort, scan and reduce that remove the need to write custom kernels for common operations Cooperative algorithms – Device-side algorithms such as block-wide or warp-wide reductions or scans that simplify custom kernel development Language idiomatic CUDA abstractions – Fundamental abstractions for CUDA-specific operations including memory allocation, resource management, and hardware features This post introduces a new group of functionality in CCCL that provides modernized C++ abstractions for fundamental CUDA programming model concepts that make CUDA C++ development safer and more convenient. What is CCCL runtime? NVIDIA CCCL runtime is a new set of idiomatic C++ APIs available starting in CUDA 13.2 that implement core CUDA functionality: stream management, memory allocation, kernel launches, and more. The familiar NVIDIA CUDA runtime was originally developed as a convenience layer on top of the CUDA driver API. The new CCCL runtime aims to be an alternative with the same goal, but with an updated design aligned with modern C++. Figure 1, below, shows the relationship between the three CUDA API surfaces mentioned above: CCCL runtime is a collection of headers within CCCL, such as