{"slug": "cccl-runtime-a-modern-c-runtime-for-cuda", "title": "CCCL Runtime: A Modern C++ Runtime for CUDA", "summary": "NVIDIA released CCCL Runtime, a modern C++ runtime for CUDA, as part of CUDA 13.2. The new APIs provide safer and more convenient abstractions for stream management, memory allocation, and kernel launches, leveraging modern C++ features and lessons from 20 years of CUDA evolution.", "body_md": "The [NVIDIA CUDA Core Compute Libraries (CCCL)](https://github.com/NVIDIA/cccl) provides delightful and efficient abstractions for CUDA developers in C++ and Python. It features:\n\n**Parallel algorithms**– Host-launched algorithms including sort, scan and reduce that remove the need to write custom kernels for common operations**Cooperative algorithms –** Device-side algorithms such as block-wide or warp-wide reductions or scans that simplify custom kernel development**Language idiomatic CUDA abstractions**– Fundamental abstractions for CUDA-specific operations including memory allocation, resource management, and hardware features\n\nThis post introduces a new group of functionality in CCCL that provides modernized C++ abstractions for fundamental CUDA programming model concepts that make CUDA C++ development safer and more convenient.\n\n## What is CCCL runtime?\n\nNVIDIA CCCL runtime is a new set of idiomatic C++ APIs available starting in CUDA 13.2 that implement core CUDA functionality: stream management, memory allocation, kernel launches, and more.\n\nThe familiar NVIDIA CUDA runtime was originally developed as a convenience layer on top of the CUDA driver API. The new CCCL runtime aims to be an alternative with the same goal, but with an updated design aligned with modern C++. Figure 1, below, shows the relationship between the three CUDA API surfaces mentioned above:\n\nCCCL runtime is a collection of headers within CCCL, such as `<cuda/stream>`\n\n, `<cuda/buffer>`\n\n, and `<cuda/launch>.`\n\nIt leverages modern C++ features to provide more convenient and robust abstractions than what was possible within the C source compatibility constraints of the traditional CUDA runtime API.\n\nWe also took the opportunity to incorporate lessons learned over 20 years of CUDA evolution into the API design. Even with all these changes, CCCL runtime provides compatibility helpers that let developers adopt it incrementally without rewriting surrounding code that uses the CUDA runtime API.\n\nAs CUDA programs grow more complex, with multiple libraries sharing devices, streams, and memory, the need for APIs that compose cleanly and make dependencies explicit becomes more pressing. That is the space CCCL runtime is designed to fill.\n\n## The code\n\nHere is the classic `vectorAdd`\n\nexample implemented with the new CCCL runtime APIs. If you’ve written CUDA before, the overall structure will be familiar: Focus on what’s different. Don’t try to understand everything at once, the rest of this post will walk through this example to explain the semantics and design choices behind CCCL runtime.\n\n```\n#include <cuda/buffer> \n#include <cuda/devices>                                                                                                                                                                                                                                                             \n#include <cuda/launch>\n#include <cuda/memory_pool>\n#include <cuda/std/span>\n#include <cuda/stream>\n                                                                                                                                                                                                                                                       \n                                                                                                                                                                                                                                                             \n                                                                                                                                                                                                                                                        \n                                          \n                                                                                                                                                                                                                                                                                    \nstruct kernel {                                                                                                                                                                                                                                                                     \n  template <typename Config>                                       \n  __device__ void operator()(Config config,                                                                                                                                                                                                                                         \n                             cuda::std::span<const int> A,\n                             cuda::std::span<const int> B,       \n                             cuda::std::span<int> C) {                                                                                                                                                                                                                            \n    auto tid = cuda::gpu_thread.rank(cuda::grid, config);                                                                                                                                                                                                                           \n    if (tid < A.size())                                            \n      C[tid] = A[tid] + B[tid];          \n  }                                                                \n};                                                                                                                                                                                                                                                                                  \n                                                                                                                                                                                                                                                                                    \nint main() {\n  // 1. Devices and streams                                                       \n  cuda::device_ref device = cuda::devices[0];\n  cuda::stream stream{device};                                     \n  \n  // 2. Memory allocation                                                                 \n  auto pool = cuda::device_default_memory_pool(device);            \n                                           \n  int num_elements = 1000;                                         \n  auto A = cuda::make_buffer<int>(stream, pool, num_elements, 1);\n  auto B = cuda::make_buffer<int>(stream, pool, num_elements, 2);\n  auto C = cuda::make_buffer<int>(stream, pool, num_elements, cuda::no_init);                                                          \n             \n  // 3. Kernel launch                                                      \n  constexpr int threads_per_block = 256;                           \n  auto config = cuda::distribute<threads_per_block>(num_elements); \n                                                                   \n  cuda::launch(stream, config, kernel{}, A, B, C);                 \n\n  // Make the CPU thread wait for the GPU work to finish.\n  stream.sync();\n  return 0;\n}\n```\n\nThe example can be broken down into the following three main sections:\n\n## 1.) Devices and streams\n\nConsider the creation of a stream using the CUDA Runtime API as the following code snippet shows.\n\n```\ncudaStream_t stream;\ncudaStreamCreate(&stream); // associated with whichever device happens to be \"current\"\n```\n\nNote this creates a stream, but the stream is associated with whichever device is current when `cudaStreamCreate`\n\nis called. Based on this call alone, you don’t know which device the stream is associated with.\n\nContrast that with using CCCL runtime API as illustrated by the code snippet that follows.\n\n```\ncuda::device_ref device = cuda::devices[0];\ncuda::stream stream{device};\n```\n\nThe above code snippet shows how to create a stream on a specific device. The first line illustrates a core design principle: CCCL runtime uses dedicated types instead of raw identifiers. A device is a `device_ref`\n\n, not a plain integer; a stream is an object, not an opaque pointer. Strong typing across the API helps catch mistakes at compile time rather than chasing them at runtime.\n\nThe second line illustrates another principle: making dependencies explicit. In both CCCL runtime and the CUDA runtime API, a stream is associated with a device. The difference is how. Here, the `cuda::stream`\n\nconstructor takes the device as an explicit argument whereas with the CUDA runtime API the stream is associated with whichever device is active when the stream is created.\n\nExplicit dependencies enable local reasoning. You can read a function and understand what it does without tracking the global state. They also improve composability: When multiple libraries are used, none of them need to save and restore implicit state across calls to avoid interfering with each other.\n\nA related consequence is that CCCL runtime doesn’t expose the default stream. Managing the meaning of the default stream requires tracking the current device, which is exactly the kind of implicit state we are moving away from. While a default stream from the CUDA runtime API can still be wrapped into CCCL runtime types, its usage is discouraged; anything involving the default stream should be handled through the CUDA runtime API directly. With no default stream in the API, the notion of a “blocking stream” no longer applies, so all CCCL runtime streams are created as non-blocking.\n\n### Resource ownership: Owning types and refs\n\nFollowing the example of `std::string`\n\nand `std::string_view`\n\n, many CUDA objects have two types in CCCL runtime: an owning type and a non-owning type with a `_ref`\n\nsuffix; `cuda::stream`\n\nowns the underlying `cudaStream_t`\n\nhandle and destroys it in its destructor. The `cuda::stream_ref`\n\nholds the handle without managing its lifetime and is trivially copyable.\n\nThe `_ref`\n\ntypes are essential for composability with existing code. If a stream handle’s lifetime is managed elsewhere, `cudaStream_t`\n\nimplicitly converts to `cuda::stream_ref`\n\n, and the raw handle can be retrieved with `.get()`\n\n. To transfer ownership, `cuda::stream::from_native_handle`\n\nwraps a raw handle into the owning type, and `.release()`\n\nrelinquishes ownership back.\n\n```\nvoid stream_type_example(cudaStream_t handle) {\n  cuda::stream_ref non_owning{handle};\n  assert(handle == non_owning.get());\n\n  cuda::stream owning = cuda::stream::from_native_handle(handle);\n  assert(handle == owning.get());\n  assert(handle == owning.release());\n}\n```\n\nThe same pattern applies to events, memory pools, and other CUDA objects: `cuda::device_ref`\n\nhas no owning counterpart because there is no device state to own.\n\n## 2.) Memory allocation\n\n```\nauto pool = cuda::device_default_memory_pool(device);\n\nauto A = cuda::make_buffer<int>(stream, pool, num_elements, 1);\nauto B = cuda::make_buffer<int>(stream, pool, num_elements, 2);\nauto C = cuda::make_buffer<int>(stream, pool, num_elements, cuda::no_init);\n```\n\nThe next section demonstrates asynchronously allocating and initializing device memory. Here we see the next design principle: APIs are asynchronous by default. Rather than distinguishing synchronous and asynchronous variants by name, CCCL runtime uses a simple convention: If an API takes a stream as its first argument, it operates in stream order. We don’t plan to provide synchronous counterparts for APIs that have both variants in the CUDA runtime API.\n\nMemory allocation is where this matters most in practice. Stream-ordered memory management via memory pools has been available since CUDA 11.2 (explained [here](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/)), and CUDA 13.0 expanded it to managed and host memory. Memory pooling and less frequent synchronization points are in most cases essential to reach maximum performance, and stream-ordered memory management composes naturally with the rest of the asynchronous programming model. To convey those guidelines, CCCL runtime makes memory pools and stream-ordered allocation the default. On older CUDA versions and platforms, where newer memory pool types are not yet supported, we provide non-stream-ordered allocation as a fallback, but plan to remove it once pool support is universal.\n\nIn the snippet above, we first query the default memory pool for a given device, passing it as an explicit argument rather than relying on `cudaMallocAsync`\n\n‘s implicit device selection. The example uses the default pool which should be preferred where possible, but CCCL runtime also allows creating separate pool objects when different pool settings are needed.\n\nThe pool reference is then used to create three buffers using the new `cuda::make_buffer.`\n\nIt takes a stream as its first argument to signal stream-ordered operation. Each buffer submits three operations to that stream: allocation from the specified pool, initialization, and eventually deallocation when the buffer goes out of scope.\n\nInitialization is mandatory unless explicitly opted out with `cuda::no_init`\n\n, as with buffer C which will be overwritten by the kernel. Uninitialized device memory is a common source of hard-to-diagnose bugs, so we chose to require an explicit opt-out rather than making it the silent default. Input buffers A and B have all elements initialized to 1 and 2, respectively. Buffers support additional initialization modes as well, for example from another buffer or a range.\n\n### Buffer lifetime and deallocation\n\nThe stream passed to `make_buffer`\n\nis stored inside the buffer and used for deallocation when the buffer is destroyed. This means the buffer should generally hold the stream that corresponds to its usage, so that computation is properly ordered with deallocation. It is possible to change the stream later with `.set_stream()`\n\nor manually trigger destruction on a specific stream with `.destroy()`\n\n, but the default behavior is designed to do the right thing in the common case.\n\n```\n{\n  auto pool = cuda::device_default_memory_pool(device);\n  // Equivalent to cudaMallocFromPoolAsync on the stream, possibly along with initialization pushed into the stream as well. Saves the stream for future deallocation\n  auto buffer = cuda::make_buffer(allocation_stream, pool, /*... */);\n  \n  // buffer usage...\n}\n// Closing bracket will call cudaFreeAsync on allocation_stream, there is also buffer.destroy(which_stream) to keep the behavior explicit\n```\n\n## 3.) Kernel launch\n\n```\nstruct kernel {\n  template <typename Config>\n  __device__ void operator()(Config config,\n                             cuda::std::span<const int> A,\n                             cuda::std::span<const int> B,\n                             cuda::std::span<int> C) {\n    auto tid = cuda::gpu_thread.rank(cuda::grid, config);\n    if (tid < A.size())\n      C[tid] = A[tid] + B[tid];\n  }\n};\n\n// ...\n\nconstexpr int threads_per_block = 256;\nauto config = cuda::distribute<threads_per_block>(num_elements);\n\ncuda::launch(stream, config, kernel{}, A, B, C);\n```\n\nThe final section demonstrates configuring and launching the kernel on the GPU with `cuda::launch`\n\n.\n\n`cuda::launch`\n\ntakes three groups of arguments:\n\n- The stream to run on\n- A configuration object that encodes the thread hierarchy (block and grid sizes) along with other launch options. Here,\n`cuda::distribute`\n\ncreates a configuration that launches at least`num_elements`\n\nthreads grouped into blocks of`threads_per_block`\n\n. This replaces the common pattern many CUDA developers are familiar with of`(N + block_size - 1) / block_size`\n\n- The kernel and its arguments\n\n## Compile-time configuration flow\n\nThe most novel aspect of `cuda::launch`\n\nis how it moves compile-time information from the host launch site into device code through the type system. For example, notice how the block size is provided as a template argument to `cuda::distribute`\n\n, which means it is encoded in the configuration object’s type.\n\nWhen the kernel accepts that configuration as its first argument, `cuda::launch`\n\npasses it through automatically. Inside the kernel, this static information is available when we compute the rank of the calling thread inside the grid:\n\n```\nauto tid = cuda::gpu_thread.rank(cuda::grid, config);\n```\n\nBecause the block size is known at compile time, the rank calculation can use only the `x`\n\ndimension and skip the runtime block-size query entirely. This is a simple example, but the mechanism generalizes. The CCCL documentation shows further cases where configuration-embedded information is used to specialize device code.\n\nSometimes kernel implementation makes assumptions about the exact shape of the grid and/or block. Compile time information in the configuration object allows kernel authors to implement checks to ensure alignment of the kernel and the call site in those cases.\n\n```\ntemplate <typename Config>\n__global__ void kernel(Config conf) {\n    // Make sure the block is one dimensional with 256 threads\n    static_assert(cuda::gpu_thread.static_dims(cuda::block, conf).x == 256);\n    static_assert(cuda::gpu_thread.static_dims(cuda::block, conf).y == 1);\n    static_assert(cuda::gpu_thread.static_dims(cuda::block, conf).z == 1);\n}\n```\n\n## Kernel functors\n\nYou may have noticed the kernel is a struct with a` __device__ operator()`\n\nrather than a` __global__ `\n\nfunction. While `cuda::launch`\n\nsupports existing` __global__ `\n\nfunctions, we also introduced kernel functors: types with a `__device__-`\n\nannotated call operator. The practical advantage is that template arguments are deduced automatically, whereas `__global__`\n\nfunctions used with `cuda::launch`\n\nrequire explicit instantiation.\n\n```\ntemplate <typename T>\n__global__ void kernel_function(T input) {\n  // body ...\n}\n\nstruct kernel_functor {\n  template <typename T>\n  __device__ void operator()(T input) {\n  // body ...  \n  }\n};\n\n// explicit template instantiation is required with a __global__ function\ncuda::launch(stream, config, kernel_function<int>, 42);\n// deduction from arguments for a functor with __device__ call operator\ncuda::launch(stream, config, kernel_functor{}, 42);\n```\n\nThis is what makes the compile-time configuration flow work. The `config`\n\ntemplate parameter is deduced from the configuration object passed by `cuda::launch`\n\n. Kernel functors also cover device lambdas and have additional features described in the [CCCL documentation](https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/launch.html#launch).\n\n## Automatic argument transformation\n\n`cuda::buffer `\n\nowns its underlying allocation, but CUDA kernels can only accept trivially copyable arguments. When a buffer is passed to `cuda::launch`\n\n, it is automatically transformed to `cuda::std::span`\n\n. There is no need to manually construct the span or extract a raw pointer. The kernel signature reflects how the data is actually used on the device side.\n\n## What’s next\n\nThis post covered the core ideas behind CCCL runtime: explicit dependencies, strong typing, asynchronous-by-default APIs, and clean interoperability with existing CUDA code. But a walkthrough of one example can only show so much.\n\nThe CCCL documentation has more detailed coverage of each API, including [additional buffer initialization modes](https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/buffer.html#construction), [event management](https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/event.html#events), [data movement](https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/algorithm.html#algorithm), and advanced kernel launch features like [dynamic shared memory](https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/launch.html#id5) and [other launch attributes](https://nvidia.github.io/cccl/unstable/libcudacxx/runtime/launch.html#launch-options). The CCCL runtime APIs are available in CCCL 3.2 and newer, which ships with CUDA Toolkit 13.2 and newer. See the CCCL documentation for detailed per-API availability. We’d love to hear your feedback as you try it out.", "url": "https://wpnews.pro/news/cccl-runtime-a-modern-c-runtime-for-cuda", "canonical_source": "https://developer.nvidia.com/blog/cccl-runtime-a-modern-c-runtime-for-cuda/", "published_at": "2026-06-22 16:00:00+00:00", "updated_at": "2026-06-24 00:16:44.899256+00:00", "lang": "en", "topics": ["developer-tools", "ai-infrastructure", "ai-tools"], "entities": ["NVIDIA", "CUDA", "CCCL Runtime", "CUDA Core Compute Libraries", "CUDA 13.2"], "alternates": {"html": "https://wpnews.pro/news/cccl-runtime-a-modern-c-runtime-for-cuda", "markdown": "https://wpnews.pro/news/cccl-runtime-a-modern-c-runtime-for-cuda.md", "text": "https://wpnews.pro/news/cccl-runtime-a-modern-c-runtime-for-cuda.txt", "jsonld": "https://wpnews.pro/news/cccl-runtime-a-modern-c-runtime-for-cuda.jsonld"}}