CCCL Runtime: A Modern C++ Runtime for CUDA

NVIDIA released CCCL Runtime, a modern C++ runtime for CUDA, as part of CUDA 13.2. The new APIs provide safer and more convenient abstractions for stream management, memory allocation, and kernel launches, leveraging modern C++ features and lessons from 20 years of CUDA evolution.

The NVIDIA CUDA Core Compute Libraries CCCL https://github.com/NVIDIA/cccl provides delightful and efficient abstractions for CUDA developers in C++ and Python. It features: Parallel algorithms – Host-launched algorithms including sort, scan and reduce that remove the need to write custom kernels for common operations Cooperative algorithms – Device-side algorithms such as block-wide or warp-wide reductions or scans that simplify custom kernel development Language idiomatic CUDA abstractions – Fundamental abstractions for CUDA-specific operations including memory allocation, resource management, and hardware features This post introduces a new group of functionality in CCCL that provides modernized C++ abstractions for fundamental CUDA programming model concepts that make CUDA C++ development safer and more convenient. What is CCCL runtime? NVIDIA CCCL runtime is a new set of idiomatic C++ APIs available starting in CUDA 13.2 that implement core CUDA functionality: stream management, memory allocation, kernel launches, and more. The familiar NVIDIA CUDA runtime was originally developed as a convenience layer on top of the CUDA driver API. The new CCCL runtime aims to be an alternative with the same goal, but with an updated design aligned with modern C++. Figure 1, below, shows the relationship between the three CUDA API surfaces mentioned above: CCCL runtime is a collection of headers within CCCL, such as <cuda/stream , <cuda/buffer , and <cuda/launch . It leverages modern C++ features to provide more convenient and robust abstractions than what was possible within the C source compatibility constraints of the traditional CUDA runtime API. We also took the opportunity to incorporate lessons learned over 20 years of CUDA evolution into the API design. Even with all these changes, CCCL runtime provides compatibility helpers that let developers adopt it incrementally without rewriting surrounding code that uses the CUDA runtime API. As CUDA programs grow more complex, with multiple libraries sharing devices, streams, and memory, the need for APIs that compose cleanly and make dependencies explicit becomes more pressing. That is the space CCCL runtime is designed to fill. The code Here is the classic vectorAdd example implemented with the new CCCL runtime APIs. If you’ve written CUDA before, the overall structure will be familiar: Focus on what’s different. Don’t try to understand everything at once, the rest of this post will walk through this example to explain the semantics and design choices behind CCCL runtime. include <cuda/buffer include <cuda/devices include <cuda/launch include <cuda/memory pool include <cuda/std/span include <cuda/stream struct kernel { template <typename Config device void operator Config config, cuda::std::span<const int A, cuda::std::span<const int B, cuda::std::span<int C { auto tid = cuda::gpu thread.rank cuda::grid, config ; if tid < A.size C tid = A tid + B tid ; } }; int main { // 1. Devices and streams cuda::device ref device = cuda::devices 0 ; cuda::stream stream{device}; // 2. Memory allocation auto pool = cuda::device default memory pool device ; int num elements = 1000; auto A = cuda::make buffer<int stream, pool, num elements, 1 ; auto B = cuda::make buffer<int stream, pool, num elements, 2 ; auto C = cuda::make buffer<int stream, pool, num elements, cuda::no init ; // 3. Kernel launch constexpr int threads per block = 256; auto config = cuda::distribute<threads per block num elements ; cuda::launch stream, config, kernel{}, A, B, C ; // Make the CPU thread wait for the GPU work to finish. stream.sync ; return 0; } The example can be broken down into the following three main sections: 1. Devices and streams Consider the creation of a stream using the CUDA Runtime API as the following code snippet shows. cudaStream t stream; cudaStreamCreate &stream ; // associated with whichever device happens to be "current" Note this creates a stream, but the stream is associated with whichever device is current when cudaStreamCreate is called. Based on this call alone, you don’t know which device the stream is associated with. Contrast that with using CCCL runtime API as illustrated by the code snippet that follows. cuda::device ref device = cuda::devices 0 ; cuda::stream stream{device}; The above code snippet shows how to create a stream on a specific device. The first line illustrates a core design principle: CCCL runtime uses dedicated types instead of raw identifiers. A device is a device ref , not a plain integer; a stream is an object, not an opaque pointer. Strong typing across the API helps catch mistakes at compile time rather than chasing them at runtime. The second line illustrates another principle: making dependencies explicit. In both CCCL runtime and the CUDA runtime API, a stream is associated with a device. The difference is how. Here, the cuda::stream constructor takes the device as an explicit argument whereas with the CUDA runtime API the stream is associated with whichever device is active when the stream is created. Explicit dependencies enable local reasoning. You can read a function and understand what it does without tracking the global state. They also improve composability: When multiple libraries are used, none of them need to save and restore implicit state across calls to avoid interfering with each other. A related consequence is that CCCL runtime doesn’t expose the default stream. Managing the meaning of the default stream requires tracking the current device, which is exactly the kind of implicit state we are moving away from. While a default stream from the CUDA runtime API can still be wrapped into CCCL runtime types, its usage is discouraged; anything involving the default stream should be handled through the CUDA runtime API directly. With no default stream in the API, the notion of a “blocking stream” no longer applies, so all CCCL runtime streams are created as non-blocking. Resource ownership: Owning types and refs Following the example of std::string and std::string view , many CUDA objects have two types in CCCL runtime: an owning type and a non-owning type with a ref suffix; cuda::stream owns the underlying cudaStream t handle and destroys it in its destructor. The cuda::stream ref holds the handle without managing its lifetime and is trivially copyable. The ref types are essential for composability with existing code. If a stream handle’s lifetime is managed elsewhere, cudaStream t implicitly converts to cuda::stream ref , and the raw handle can be retrieved with .get . To transfer ownership, cuda::stream::from native handle wraps a raw handle into the owning type, and .release relinquishes ownership back. void stream type example cudaStream t handle { cuda::stream ref non owning{handle}; assert handle == non owning.get ; cuda::stream owning = cuda::stream::from native handle handle ; assert handle == owning.get ; assert handle == owning.release ; } The same pattern applies to events, memory pools, and other CUDA objects: cuda::device ref has no owning counterpart because there is no device state to own. 2. Memory allocation auto pool = cuda::device default memory pool device ; auto A = cuda::make buffer<int stream, pool, num elements, 1 ; auto B = cuda::make buffer<int stream, pool, num elements, 2 ; auto C = cuda::make buffer<int stream, pool, num elements, cuda::no init ; The next section demonstrates asynchronously allocating and initializing device memory. Here we see the next design principle: APIs are asynchronous by default. Rather than distinguishing synchronous and asynchronous variants by name, CCCL runtime uses a simple convention: If an API takes a stream as its first argument, it operates in stream order. We don’t plan to provide synchronous counterparts for APIs that have both variants in the CUDA runtime API. Memory allocation is where this matters most in practice. Stream-ordered memory management via memory pools has been available since CUDA 11.2 explained here https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/ , and CUDA 13.0 expanded it to managed and host memory. Memory pooling and less frequent synchronization points are in most cases essential to reach maximum performance, and stream-ordered memory management composes naturally with the rest of the asynchronous programming model. To convey those guidelines, CCCL runtime makes memory pools and stream-ordered allocation the default. On older CUDA versions and platforms, where newer memory pool types are not yet supported, we provide non-stream-ordered allocation as a fallback, but plan to remove it once pool support is universal. In the snippet above, we first query the default memory pool for a given device, passing it as an explicit argument rather than relying on cudaMallocAsync ‘s implicit device selection. The example uses the default pool which should be preferred where possible, but CCCL runtime also allows creating separate pool objects when different pool settings are needed. The pool reference is then used to create three buffers using the new cuda::make buffer. It takes a stream as its first argument to signal stream-ordered operation. Each buffer submits three operations to that stream: allocation from the specified pool, initialization, and eventually deallocation when the buffer goes out of scope. Initialization is mandatory unless explicitly opted out with cuda::no init , as with buffer C which will be overwritten by the kernel. Uninitialized device memory is a common source of hard-to-diagnose bugs, so we chose to require an explicit opt-out rather than making it the silent default. Input buffers A and B have all elements initialized to 1 and 2, respectively. Buffers support additional initialization modes as well, for example from another buffer or a range. Buffer lifetime and deallocation The stream passed to make buffer is stored inside the buffer and used for deallocation when the buffer is destroyed. This means the buffer should generally hold the stream that corresponds to its usage, so that computation is properly ordered with deallocation. It is possible to change the stream later with .set stream or manually trigger destruction on a specific stream with .destroy , but the default behavior is designed to do the right thing in the common case. { auto pool = cuda::device default memory pool device ; // Equivalent to cudaMallocFromPoolAsync on the stream, possibly along with initialization pushed into the stream as well. Saves the stream for future deallocation auto buffer = cuda::make buffer allocation stream, pool, / ... / ; // buffer usage... } // Closing bracket will call cudaFreeAsync on allocation stream, there is also buffer.destroy which stream to keep the behavior explicit 3. Kernel launch struct kernel { template <typename Config device void operator Config config, cuda::std::span<const int A, cuda::std::span<const int B, cuda::std::span<int C { auto tid = cuda::gpu thread.rank cuda::grid, config ; if tid < A.size C tid = A tid + B tid ; } }; // ... constexpr int threads per block = 256; auto config = cuda::distribute<threads per block num elements ; cuda::launch stream, config, kernel{}, A, B, C ; The final section demonstrates configuring and launching the kernel on the GPU with cuda::launch . cuda::launch takes three groups of arguments: - The stream to run on - A configuration object that encodes the thread hierarchy block and grid sizes along with other launch options. Here, cuda::distribute creates a configuration that launches at least num elements threads grouped into blocks of threads per block . This replaces the common pattern many CUDA developers are familiar with of N + block size - 1 / block size - The kernel and its arguments Compile-time configuration flow The most novel aspect of cuda::launch is how it moves compile-time information from the host launch site into device code through the type system. For example, notice how the block size is provided as a template argument to cuda::distribute , which means it is encoded in the configuration object’s type. When the kernel accepts that configuration as its first argument, cuda::launch passes it through automatically. Inside the kernel, this static information is available when we compute the rank of the calling thread inside the grid: auto tid = cuda::gpu thread.rank cuda::grid, config ; Because the block size is known at compile time, the rank calculation can use only the x dimension and skip the runtime block-size query entirely. This is a simple example, but the mechanism generalizes. The CCCL documentation shows further cases where configuration-embedded information is used to specialize device code. Sometimes kernel implementation makes assumptions about the exact shape of the grid and/or block. Compile time information in the configuration object allows kernel authors to implement checks to ensure alignment of the kernel and the call site in those cases. template <typename Config global void kernel Config conf { // Make sure the block is one dimensional with 256 threads static assert cuda::gpu thread.static dims cuda::block, conf .x == 256 ; static assert cuda::gpu thread.static dims cuda::block, conf .y == 1 ; static assert cuda::gpu thread.static dims cuda::block, conf .z == 1 ; } Kernel functors You may have noticed the kernel is a struct with a device operator rather than a global function. While cuda::launch supports existing global functions, we also introduced kernel functors: types with a device - annotated call operator. The practical advantage is that template arguments are deduced automatically, whereas global functions used with cuda::launch require explicit instantiation. template <typename T global void kernel function T input { // body ... } struct kernel functor { template <typename T device void operator T input { // body ... } }; // explicit template instantiation is required with a global function cuda::launch stream, config, kernel function<int , 42 ; // deduction from arguments for a functor with device call operator cuda::launch stream, config, kernel functor{}, 42 ; This is what makes the compile-time configuration flow work. The config template parameter is deduced from the configuration object passed by cuda::launch . Kernel functors also cover device lambdas and have additional features described in the CCCL documentation https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/launch.html launch . Automatic argument transformation cuda::buffer owns its underlying allocation, but CUDA kernels can only accept trivially copyable arguments. When a buffer is passed to cuda::launch , it is automatically transformed to cuda::std::span . There is no need to manually construct the span or extract a raw pointer. The kernel signature reflects how the data is actually used on the device side. What’s next This post covered the core ideas behind CCCL runtime: explicit dependencies, strong typing, asynchronous-by-default APIs, and clean interoperability with existing CUDA code. But a walkthrough of one example can only show so much. The CCCL documentation has more detailed coverage of each API, including additional buffer initialization modes https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/buffer.html construction , event management https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/event.html events , data movement https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/algorithm.html algorithm , and advanced kernel launch features like dynamic shared memory https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/launch.html id5 and other launch attributes https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/launch.html launch-options . The CCCL runtime APIs are available in CCCL 3.2 and newer, which ships with CUDA Toolkit 13.2 and newer. See the CCCL documentation for detailed per-API availability. We’d love to hear your feedback as you try it out.