{"slug": "what-happens-when-you-run-a-cuda-kernel", "title": "What happens when you run a CUDA kernel?", "summary": "NVIDIA's CUDA compiler pipeline transforms a simple vector addition kernel from PTX virtual assembly to SASS machine code through multiple compilation stages, including LLVM-based cicc and ptxas, before executing on an RTX 4090 GPU via tens of millions of CPU instructions, device files, ioctls, and memory-mapped doorbell registers.", "body_md": "# What happens when you run a CUDA kernel\n\n[(1615).](https://commons.wikimedia.org/wiki/File:Les_raisons_des_forces_mouuantes_auec_diuerses_machines_tant_vtilles_que_plaisantes_aus_quelles_sont_adioints_plusieurs_desseings_de_grotes_et_fontaines_%281615%29_%2814740673966%29.jpg)\n\n*Les Raisons des Forces Mouvantes*Here’s a simple CUDA program. It adds two vectors.\n\n``` js\n__global__ void vadd(const float* a, const float* b, float* c, int n) {\n    int i = blockIdx.x * blockDim.x + threadIdx.x;\n    if (i < n) c[i] = a[i] + b[i];\n}\n\nint main() {\n    int n = 1 << 20;                 // a million floats (1,048,576)\n    size_t bytes = n * sizeof(float);\n\n    float *a = (float*)malloc(bytes), *b = (float*)malloc(bytes),\n          *c = (float*)malloc(bytes);\n    for (int i = 0; i < n; i++) a[i] = b[i] = 1.0f;\n\n    float *da, *db, *dc;\n    cudaMalloc(&da, bytes);\n    cudaMalloc(&db, bytes);\n    cudaMalloc(&dc, bytes);\n    cudaMemcpy(da, a, bytes, cudaMemcpyHostToDevice);\n    cudaMemcpy(db, b, bytes, cudaMemcpyHostToDevice);\n\n    vadd<<<4096, 256>>>(da, db, dc, n);   // 4096 * 256 = n threads, one per float\n\n    cudaMemcpy(c, dc, bytes, cudaMemcpyDeviceToHost);\n    printf(\"c[0]=%f c[n-1]=%f\\n\", c[0], c[n-1]);\n}\n```\n\nCompiled for an RTX 4090, and launched, it does correctly work out that , a million timesI didn’t check all of them..\n\n``` bash\n$ nvcc -arch=sm_89 -o vadd vadd.cu && ./vadd\nc[0]=2.000000 c[n-1]=2.000000\n```\n\nTelling you that involved tens of millions of CPU instructions, a couple of\ndevice files, nine hundred ioctls, and one memory-mapped doorbell register. In\nthis post, we’ll follow this one kernel from the code down to the warps, and\nback up to the answerAn aside, this post is an instance of the ‘legibility transition’ that\nagents have engendered. There really is very little about computers you can’t\nfind out with curiosity and (machine-enhanced) persistence. An interesting\ndiscussion of the implications of legibility for what AI can help us to know\n[here](https://resobscura.substack.com/p/ai-legibility-archives-future-of-research)..\n\n## Compiling our program with `nvcc`\n\nWe ought to start with how to turn this CUDA program into something that the device can actually read. To do that we need a compiler. Really, we need many compilers.\n\n`nvcc`\n\nis a driver program that runs several other compilers and combines their\noutput. If you pass `--keep`\n\nit leaves the whole pipeline on disk for you to\nread:\n\n``` bash\n$ nvcc --keep -arch=sm_89 -o vadd vadd.cu && ls\n...\nvadd.ptx            # device code as PTX        (from cicc)\nvadd.sm_89.cubin    # device code as SASS       (from ptxas)\nvadd.fatbin         # cubin + PTX, bundled      (from fatbinary)\nvadd.cudafe1.stub.c # host launch stub + kernel registration\nvadd.o              # final host object, fatbin embedded\n...\n```\n\nThe host code goes to your host compiler. The device code (`vadd`\n\n) takes more\nsteps: `cicc`\n\n, an [LLVM](https://en.wikipedia.org/wiki/LLVM)-based compiler,\nturns it into\n[PTX](https://developer.nvidia.com/blog/understanding-ptx-the-assembly-language-of-cuda-gpu-computing/),\nand then `ptxas`\n\nturns the PTX into\n[SASS](https://modal.com/gpu-glossary/device-software/streaming-assembler).\n\nPTX is a virtual\n[ISA](https://en.wikipedia.org/wiki/Instruction_set_architecture). It has\ninfinitely many typed registers, and no notion of how many of them the hardware\nactually has. Here is the (elided) body of `vadd`\n\nin PTX:\n\n``` bash\n$ cat vadd.ptx\n...\nmad.lo.s32      %r1, %r3, %r4, %r5;        // set register r1 to ctaid*ntid + tid\nsetp.ge.s32     %p1, %r1, %r2;             // set predicate p1 if i >= n\n@%p1 bra        $L__BB0_2;                 // if out of bounds, skip to exit\ncvta.to.global.u64  %rd4, %rd1;            // convert generic pointer %rd1 to a global address, store in %rd4\nmul.wide.s32    %rd5, %r1, 4;              // multiply r1 by 4, store the result in %rd5\nadd.s64         %rd6, %rd4, %rd5;          // add %rd4, %rd5, result in %rd6\nld.global.f32   %f2, [%rd6];               // load a[i] into %f2\n...\nadd.f32         %f3, %f2, %f1;             // add %f1 and %f2, result in %f3\nst.global.f32   [%rd10], %f3;              // store c[i] = ... in global memory\n```\n\nThe virtual registers look like `%rd1`\n\n–`%rd10`\n\n, `%f1`\n\n–`%f3`\n\nThe prefix is the type: `%r`\n\nis a 32-bit integer, `%rd`\n\na 64-bit one,\n`%f`\n\na 32-bit float, `%p`\n\na one-bit predicate..\n\nPTX is more ‘longhand’ than you might expect. For example, forming one address\nin `%rd6`\n\ntakes three PTX instructions. This happens because PTX is device\nagnostic.\n\n## Why three?\n\nCUDA pointers are “generic” by default, meaning they could name global, shared,\nor local memory. `cvta.to.global`\n\nasserts the pointer lives in the global\nwindow, so a cheaper `ld.global`\n\ncan be used later. `mul.wide.s32`\n\nthen turns\nthe index `i`\n\ninto a byte offset by multiplying by 4 (`sizeof(float)`\n\n) and\nwidening 32→64 bits in one step. `add.s64`\n\nadds that to the base pointer.\n\nNext, `ptxas`\n\ntransforms our PTX, which is device agnostic, into the SASS for\nyour architecture, which isn’t. The SASS it emits looks different:\n\n``` bash\n$ cuobjdump -sass vadd\n/*0000*/  MOV R1, c[0x0][0x28] ;                      // set up the stack pointer (ABI; unused here)\n/*0010*/  S2R R6, SR_CTAID.X ;                        // R6 = blockIdx.x\n/*0020*/  S2R R3, SR_TID.X ;                          // R3 = threadIdx.x\n/*0030*/  IMAD R6, R6, c[0x0][0x0], R3 ;              // i = ctaid*ntid + tid\n/*0040*/  ISETP.GE.AND P0, PT, R6, c[0x0][0x178], PT ;// P0 = (i >= n)\n/*0050*/  @P0 EXIT ;                                  // if so, exit\n/*0060*/  MOV R7, 0x4 ;                               // load literal 4 (sizeof(float)) into R7 as multiplier\n/*0070*/  ULDC.64 UR4, c[0x0][0x118] ;                // uniform load of a driver-provided system value\n/*0080*/  IMAD.WIDE R4, R6, R7, c[0x0][0x168] ;       // &b[i]\n/*0090*/  IMAD.WIDE R2, R6, R7, c[0x0][0x160] ;       // &a[i]\n/*00a0*/  LDG.E R4, [R4.64] ;                         // b[i]\n/*00b0*/  LDG.E R3, [R2.64] ;                         // a[i]\n/*00c0*/  IMAD.WIDE R6, R6, R7, c[0x0][0x170] ;       // &c[i]\n/*00d0*/  FADD R9, R4, R3 ;                           // a[i] + b[i]\n/*00e0*/  STG.E [R6.64], R9 ;                         // c[i] = ...\n/*00f0*/  EXIT ;\n```\n\n## What the S2R lines are doing\n\n`S2R`\n\nis “special register to register”: it copies a *special* register the\nhardware maintains per thread — here `SR_CTAID.X`\n\n(the block’s index,\n`blockIdx.x`\n\n) and `SR_TID.X`\n\n(the lane’s index within the block, `threadIdx.x`\n\n)\n— into an ordinary register so `IMAD`\n\ncan do arithmetic on it.\n\nTen-odd virtual registers have collapsed onto seven real ones`ncu`\n\nreports `launch__registers_per_thread = 16`\n\n. The disassembly only\nnames up to `R9`\n\n, but the allocator reserves a few more for the ABI and\nalignment.. The two\n`mul.wide`\n\nplus `add`\n\nsequences have fused into a single `IMAD.WIDE`\n\n. The\n`cvta`\n\nconversions are gone, absorbed into the addressing.\n\nThe `c[0x0][…]`\n\noperands are **constant bank 0**, in a small, driver-managed\nregion. These are the kernel’s arguments — the pointers `a`\n\n, `b`\n\n, `c`\n\nand the\nsize `n`\n\n— along with the launch geometry. Filling the bank is the job of a\nstructure called the QMD that the driver hands the GPU at launch, which we’ll\ncome to once the launch itself reaches the card.\n\n## Why the arguments sit in constant bank 0, and where\n\nThey’re in constant memory because this is a *broadcast* read: every thread in\nthe grid needs the identical pointers, and the constant cache is able to serve\nall 32 lanes in one shot. The layout is fixed — `0x160`\n\n, `0x168`\n\n, `0x170`\n\nare\nthe pointers `a`\n\n, `b`\n\n, `c`\n\n, and `0x178`\n\nis `n`\n\n, with the launch geometry\nalongside them at `0x0`\n\n(`blockDim.x`\n\n). Bank 0 also holds ABI parameters such\nas `c[0x0][0x28]`\n\n, the stack base that `MOV R1, c[0x0][0x28]`\n\nloads at entry.\nWe’ll see these same offsets again when the host stub packs the arguments for\nlaunch.\n\nThe ‘cubin’ file holding this SASS is an\n[ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) file — the\nsame object-file container Linux uses for ordinary executables and shared\nlibraries`cuobjdump -elf`\n\nshows a symbol table, a `.text.vadd`\n\nsection holding the\nmachine code, plus CUDA-specific sections like `.nv.callgraph`\n\n.. The `fatbinary`\n\nexecutable bundles the cubin together with the\nPTX into a single ‘fatbin’, and `cuobjdump`\n\non the result reveals that the\nfatbin embedded in our binary contains *both*:\n\n``` bash\n$ cuobjdump vadd\n...\nFatbin elf code:  arch = sm_89        # the SASS we just read\nFatbin ptx code:  arch = sm_89  compressed   # the PTX, shipped too\n```\n\nThe SASS is what actually runs on this 4090, but the PTX rides along as a forward-compatibility fallback. If you then take this binary to a GPU whose architecture the cubin doesn’t cover, the driver can JIT the PTX into fresh SASS at load time.\n\nFinally, that fatbin is nested in the host executable, where `readelf -S`\n\nfinds\nit occupying its own sections:\n\n``` bash\n$ readelf -S vadd\n...\n[18] .nv_fatbin        PROGBITS   ...\n[19] __nv_module_id    PROGBITS   ...\n[29] .nvFatBinSegment  PROGBITS   ...\n...\n```\n\nThe `vadd`\n\nbinary that nvcc spits out is a single executable containing host\ncode, a complete ELF object containing the Ada SASS, and a copy of the PTX.\nBecause PTX is verbose plain text, `nvcc`\n\ncompresses it by default to keep the\nbinary size small; the driver will only decompress and JIT-compile it if the\nbinary is run on an architecture that the pre-compiled SASS doesn’t cover.\n\n## How the host triggers the GPU\n\nThe compiled GPU machine code is now sitting inert inside the `.nv_fatbin`\n\nsection of our `./vadd`\n\nexecutable. When you launch the program on the host, we\nhave to bridge two worlds: the host CPU, and the GPU sitting across the PCIe\nbus.\n\nTo set up a host binary that knows how to cross the bridge, the frontend\ncompiler (`cudafe++`\n\n) inserts a hidden constructor into your code, running\nbefore the `main`\n\nfunction starts. Its job is to register our embedded\nfatbinary with the CUDA runtime and record a mapping that the runtime will\nlater use: associating the host-side function pointer `vadd`\n\nwith the compiled\ndevice kernel’s mangled name in the fatbin.\n\nWhen the compiler encounters `vadd<<<4096, 256>>>(da, db, dc, n)`\n\n, it replaces\nthat high-level expression with a generated host launch stub. This stub packs\nour kernel arguments into a buffer in host memory. The pointers `da`\n\n, `db`\n\n,\n`dc`\n\nand the integer `n`\n\nare aligned at byte offsets `0`\n\n, `8`\n\n, `16`\n\n, and `24`\n\nThese offsets are the constant bank offsets `0x160`\n\n, `0x168`\n\n, `0x170`\n\n,\nand `0x178`\n\nthat we saw our SASS machine code reading from constant bank 0\nearlier.:\n\n``` js\n// from vadd.cudafe1.stub.c\nvoid __device_stub__Z4vaddPKfS0_Pfi(const float *__par0, const float *__par1,\n                                     float *__par2, int __par3) {\n    __cudaLaunchPrologue(4);\n    __cudaSetupArgSimple(__par0,  0UL);   // arg buffer offset 0\n    __cudaSetupArgSimple(__par1,  8UL);   // offset 8\n    __cudaSetupArgSimple(__par2, 16UL);   // offset 16\n    __cudaSetupArgSimple(__par3, 24UL);   // offset 24\n    __cudaLaunch((char*)(void(*)(const float*, const float*, float*, int))vadd);\n}\n```\n\nOnce the arguments are packed, the stub calls `__cudaLaunch`\n\n, passing it\nthe memory address of the host-side dummy `vadd`\n\nfunction. Because this host\nfunction is just an empty shell on the CPU, its host memory address serves as a\nlookup key. The runtime queries its registration table with this address to\nfind the corresponding device-side symbol name, and then crosses the boundary\ninto the closed-source user-mode driver (`libcuda.so.1`\n\n)The usermode bit of the driver comes with the GPU’s kernel driver, not\nwith the CUDA toolkit: the `libcuda.so.1`\n\nfrom the `strace`\n\nresolves to\n`libcuda.so.590.48.01`\n\n, the driver release on this machine. to initiate the\nlaunch of that kernel.\n\nThe runtime opens this driver dynamically on the first GPU call in our program,\nwhich we can catch using `strace`\n\n:\n\n``` bash\n$ strace -f -e trace=openat ./vadd\n...\nopenat(..., \"/lib/x86_64-linux-gnu/libcuda.so.1\", O_RDONLY|O_CLOEXEC) = 3\n...\n```\n\nWhen this first call is performed, a ‘context’ is created, containing all the\ninfrastructure the driver needs to talk to the device, including the *channel*\nthrough which the CPU speaks to the GPU. We’ll talk more about that in the next\nsection.\n\nAt this stage, the compiled machine code still hasn’t reached the GPU. Since\nCUDA 12.2, module loading is lazy by defaultControlled by `CUDA_MODULE_LOADING`\n\n. It shipped opt-in in CUDA 11.7 and\ndefaulted to `EAGER`\n\nfor years; the 12.x series flipped the default to `LAZY`\n\n(which can be overridden if you want loading costs paid up front).—the driver defers uploading a\nkernel’s SASS cubin to the card’s memory until the very first time that\nspecific kernel is actually launched.\n\nUnderneath `libcuda`\n\nsits the kernel-mode driver, `nvidia.ko`\n\n, which `libcuda`\n\nreaches by invoking `ioctl`\n\non device files. When `cuLaunchKernel`\n\nfinally\nneeds to put work on the GPU, it becomes a conversation with that kernel\nmodule. What follows is the mechanics of that conversation.\n\n## Getting it onto the GPU\n\nA GPU does not take function calls like a CPU does. There is no entry point to\njump to, and no stack to push arguments onto from the CPU. The GPU sits across\na PCIe bus and reads a stream of driver commands out of host memory. Everything\n`cuLaunchKernel`\n\ndoes past this point is in service of getting one fully formed\nlaunch command into that stream, and then telling the GPU it has done so.\n\nThe first thing that needs to be done is loading the GPU code onto the device.\nThe first time you run `vadd`\n\n, the driver copies across the kernel’s code: it\nallocates a buffer and copies the SASS in.\n\nOnce the code is on the GPU, the CPU needs to get the GPU to read it and start\nexecuting it. It does so via a complex dance, across host and device memory.\nBoth the host and the GPU can map regions of each other’s memory spaces, but\naccesses across the PCIe bus pay a penalty. To achieve a kernel launch, both\nwrite to various structures, living across both spaces. These structures\ncomprise the *channel* — the work queue that runs the GPU’s operations.\n\nThere are two important such structures living in host RAM: the **pushbuffer**,\nand the **GPFIFO**, representing between them the list of work the GPU has to\nperform.\n\nThe **pushbuffer** is a region of memory into which the driver writes commands\nto the GPU, called *methods*. A method is a register address and a value in\nthe GPU’s native command encoding — the pair defines what action the GPU should\nperform.\n\nThe **GPFIFO** is a ring buffer of pointers, used by the GPU & CPU to\ncoordinate what the GPU still needs to read, and what it’s read already. Each\nentry in the GPFIFO is made up of two 32-bit words, describing a span of the\npushbufferIn this case, base is a GPU virtual address pointing to host memory `(base, length)`\n\n.\n\nThe GPU continually walks the GPFIFO to find work. Between the driver and the\nGPU, two cursors need to be maintained: `GP_GET`\n\n(how far the GPU has\nconsumed), and `GP_PUT`\n\n(how far the driver has produced). Both cursors live\nin USERD, a small per-channel structure that here sits in device memory. To\nlaunch a kernel, the driver fills a pushbuffer span with the relevant methods,\npoints a GPFIFO entry at it, and advances `GP_PUT`\n\n. Once the GPU consumes the\nentry, it advances `GP_GET`\n\n.\n\nWhere the different pieces live.\n\nOur launch is triggered by a burst of methods, first\n[ SET_INLINE_QMD_ADDRESS_A/B](https://github.com/NVIDIA/open-gpu-kernel-modules/blob/590.48.01/src/common/sdk/nvidia/inc/class/clc6c0.h#L403-L409)How I know it’s this method, given that\n\n`libcuda`\n\nis closed source: see\nthe [appendix](#appendix-how-to-look-inside-the-launch). followed by a run of\n\n[. These methods serve to stream an object called the “Queue Meta Data” (](https://github.com/NVIDIA/open-gpu-kernel-modules/blob/590.48.01/src/common/sdk/nvidia/inc/class/clc6c0.h#L409-L410)\n\n`LOAD_INLINE_QMD_DATA`\n\n**QMD**) into the pushbuffer.\n\nThe QMD is the launch descriptor for a compute grid. It holds the grid and\nblock dimensions — our 4096 and 256, from the `.cu`\n\ncode — the registers per\nthread and shared memory it needs, and two addresses: the program’s start (the\nSASS the first launch loaded into GPU memory) and the constant bank holding the\nkernel’s arguments. That bank is where the arguments the host stub packed land:\nthe driver copies them in and records the bank’s address in the QMD. The QMD\ntells the GPU where the SASS is, how to turn that SASS into a parallel program,\nand where to signal its completion of that program.\n\nEverything is now in place for the GPU to start running. The problem is that\nthe GPU’s **host engine** The part of the GPU’s control logic that interfaces with the host. hasn’t acted: it doesn’t watch the cursor on modern\ncardsThey used to: older GPUs [snooped USERD](https://github.com/NVIDIA/open-gpu-kernel-modules/blob/590.48.01/src/common/unix/nvidia-push/src/nvidia-push.c#L421-L438),\nso writing `GP_PUT`\n\nwas enough. Turing and later don’t, so the driver rings the\ndoorbell instead., so the change to `GP_PUT`\n\njust sits there until something tells the\nengine to look.\n\nIt is told to look through the **doorbell**. The GPU maps a small window of its\nregisters into the process, and one of them is the doorbell; the driver writes\nthe channel’s *work-submit token* to it. The token tells it which channel has\nnew work.\n\nWhen its doorbell gets rung, the host engine reads the updated `GP_PUT`\n\n,\nfollows the new GPFIFO entry to the pushbuffer span, and pulls the methods out\nof it by DMA. When it reaches the compute method carrying our QMD, it hands\nthat descriptor to the “compute work distributor”, about which more shortly.\n\nFrom the CPU’s side the launch is done: `cuLaunchKernel`\n\nreturned the moment\nthe doorbell was rung. The call was asynchronous, so control returns to the\nprogram and the CPU runs on while the GPU works; we pick the host side back up\nonce the kernel has run.\n\nIt’s time for the GPU to start doing its job.\n\n## Instruction by instruction\n\nThe host engine hands the QMD to the **compute work distributor** Sometimes still called the GigaThread Engine. There is\none of these on the whole GPU. There is one linear list of SASS instructions in\nVRAM, and the compute work distributor + the QMD is the first step in telling\nthe hardware how to make that linear list of thread instructions into a\nmassively parallel program across all the **Streaming Multiprocessors** (SMs).\n\nIn our journey down the stack, our compute work distributor now has a QMD\ndescribing 4096 blocks of 256 threads. The card we are targeting is a GeForce\nRTX 4090 chip with **128 SMs** NVIDIA’s AD102-300-A1 SKU disables 16 of the physical 144 SMs on the full\ndie to maximize manufacturing yield, as detailed in the [NVIDIA Ada GPU Architecture whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf).. The distributor’s task is to keep all 128\nsaturated with work.\n\nThe compiled machine code sits as a single linear sequence in global memory.\nEach SM contains its own local Instruction Cache (I-cache), and every active\nwarp on the GPU maintains its own private [Program\nCounter](https://en.wikipedia.org/wiki/Program_counter) (PC)Since Volta, the model goes finer still — each *thread* carries its own\nprogram counter and call stack ([Independent Thread\nScheduling](https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#independent-thread-scheduling)),\nletting threads in a warp diverge and reconverge freely. Issue is still\nper-warp, though: each cycle the scheduler picks one warp and issues to the\nlanes currently at a common PC.. Schedulers on the\nSM then fetch instructions from that linear sequence independently, allowing\ndifferent warps to execute the same SASS code at different speeds, or down\ndifferent branch paths.\n\nOne instruction stream in VRAM, cached locally per SM. An SM keeps up to\n48 warps resident (the grid), but its four schedulers issue at most one\ninstruction each per cycle. Here nearly every warp is parked on the `LDG.E`\n\nload\n(orange) and only one slot is issuing the `FADD`\n\n(green).\n\nThe hardware constraints of our SMs set the number of blocks that can run at\nthe same time`cudaGetDeviceProperties`\n\ntells you this information:\n\n```\n  +------------------------------------------------------------+\n  |                   AD102 SM Resource Caps                   |\n  +------------------------------------------------------------+\n  |  Max Active Threads/SM |  1,536 threads (48 warps)         |\n  |  Register File/SM      |  65,536 32-bit registers (256 KB) |\n  |  Shared Memory/SM      |  100 KB                           |\n  +------------------------------------------------------------+\n```\n\nOur launch configuration specifies blocks of **256 threads (8 warps)**, and\n`ptxas`\n\nreserved **16 registers per thread**.\n\n**Register capacity**: Each block needs registers. On registers alone, an SM could fit resident blocks.** Thread capacity**: The hardware caps each SM at 1,536 active threads. Divided by our block size, this yields resident blocks.\n\nBecause thread capacity is the tighter bottleneck, each SM holds at most **6\nblocks (48 warps) at once**.\n\nThe distributor assigns these 6 resident blocks to an SM. Each SM is divided\ninto **four processing blocks (sub-partitions)**. Each sub-partition is a\nself-contained execution pipeline.\n\nThe SM distributes our 48 resident warps evenly across these four\nsub-partitions, so when the SM is full each warp scheduler has **12 active warps**\n() to manage. Every cycle, a warp scheduler\nevaluates its 12 candidates, selects one *eligible* warp, and dispatches its\nnext instruction across the 32 physical lanes of its execution slice.\n\n### What does it mean for a warp to be *eligible*?\n\nA GPU decides when an instruction is ready to run differently from a CPU. A\nmodern out-of-order CPU discovers dependencies [dynamically at\nruntime](https://en.wikipedia.org/wiki/Tomasulo%27s_algorithm), with [reorder\nbuffers](https://en.wikipedia.org/wiki/Re-order_buffer) and [rename\nlogic](https://en.wikipedia.org/wiki/Register_renaming) spending silicon on\nextracting parallelism from a single thread. A GPU doesn’t need that: it hides\nlatency by keeping many warps resident and switching between them when they\nstall. With parallelism the order of the day, too much heavyweight dependency\nmachinery is the wrong use of silicon. So the hardware leans on the compiler to\nschedule everything whose timing it can predict, falling back to lightweight\nhardware scoreboards for whatever it can’t.\n\nEvery 128-bit SASS instruction carries a packed control-code payload written by\n`ptxas`\n\nThe clearest public reconstructions are the Citadel microbenchmarking\npapers ([Jia et al., “Dissecting the NVIDIA Volta GPU Architecture via\nMicrobenchmarking”](https://arxiv.org/abs/1804.06826)) and [these\nmaxas control-code notes](https://github.com/NervanaSystems/maxas/wiki/Control-Codes)\nfor Maxwell.. These scheduling control bits dictate hardware timing directly and\ncontain three key directives:\n\n**A static stall count**: For fixed-latency instructions—like standard integer or floating-point maths—the compiler knows exactly when the ALUs will write back. It encodes a precise cycle count telling the scheduler exactly how long to park this warp before issuing its very next instruction.**A yield hint**: A single bit telling the scheduler whether this warp should yield its scheduling priority. If the compiler knows this warp is about to hit a bottleneck, it sets this hint to let the scheduler prioritize other active warps on the next clock cycle.**Dependency-barrier indices**: For variable-latency operations whose duration cannot be predicted at compile time—most notably global memory loads (`LDG`\n\n) and special functions (`MUFU`\n\n)—the hardware provides**six physical scoreboard barriers (numbered 0 to 5)** per warp.\n\n## Why you won't see these bits in the disassembly\n\nWhen you disassemble a binary using NVIDIA’s standard `nvdisasm`\n\ntool, the raw\ncontrol codes are hidden by default; the tool strips them away to show you\nstandard, clean SASS mnemonics. However, they are stored directly alongside the\ninstructions. If you inspect the raw binary using `cuobjdump -sass`\n\nand look\nclosely at the hexadecimal instruction comments (e.g., `/* 0x... */`\n\n), you will\nsee the packed, raw hex words that house these control bits.\n\nWhat we know about their exact layout comes from the microbenchmarking community’s reverse-engineering efforts. Although the bit fields have shifted and evolved between Maxwell, Volta, Ampere, and Ada Lovelace, the core architectural concept remains identical: compile-time static scheduling metadata is packed directly into the instruction stream to keep the SM hardware as simple and power-efficient as possible.\n\nRunning `cuobjdump -sass`\n\non our `vadd`\n\n, each instruction comes with its raw\n128-bit encoding as two 64-bit words, and the *second* word of each pair\ncarries the control payload:\n\n``` bash\n$ cuobjdump -sass vadd                              # control payload\n/*00a0*/  LDG.E R4, [R4.64]                       /* 0x000ea8000c1e1900 */\n/*00b0*/  LDG.E R3, [R2.64]                       /* 0x000ea2000c1e1900 */\n/*00c0*/  IMAD.WIDE R6, R6, R7, c[0x0][0x170]     /* 0x000fe200078e0207 */\n/*00d0*/  FADD R9, R4, R3                         /* 0x004fca0000000000 */\n/*00e0*/  STG.E [R6.64], R9                       /* 0x000fe2000c101904 */\n```\n\nPulling out the control payloadsThe clearest public reconstructions are the Citadel microbenchmarking\npapers ([Jia et al., “Dissecting the NVIDIA Volta GPU Architecture via\nMicrobenchmarking”](https://arxiv.org/abs/1804.06826)) and [these\nmaxas control-code notes](https://github.com/NervanaSystems/maxas/wiki/Control-Codes)\nfor Maxwell. — their bit layout is in the\n\n[appendix](#decoding-the-sass-control-words)— you can see the schedule that\n\n`ptxas`\n\nwrote, with each directive in action:| instruction | stall | yield | sets | waits-on |\n|---|---|---|---|---|\n`LDG.E` | 4 | yes | `B2` | — |\n`LDG.E` | 1 | yes | `B2` | — |\n`IMAD.WIDE` | 1 | yes | — | — |\n`FADD` | 5 | no | — | `B2` |\n`STG.E` | 1 | yes | — | — |\n\nThe two loads leverage directive 3, each “set”-ing the **same** scoreboard\nbarrier, `B2`\n\n. The `FADD`\n\n, the first instruction that needs the loaded `R4`\n\nand\n`R3`\n\n, carries a wait on `B2`\n\n: until both loads have returned and the barrier\nclears, the warp is **ineligible**, and the scheduler skips it for one of the\nother eleven warps in the sub-partition.\n\nThe `FADD`\n\n→`STG`\n\nhand-off is directive 1. A floating-point add has a fixed\nlatency, so there is no barrier: `FADD`\n\njust carries `stall=5`\n\n, which parks the\nwarp for the few cycles it takes for `R9`\n\nto land before `STG`\n\nreads it.\n\nThe yield bit, directive 2, toggles on & off across the sequence, as the compiler nudges scheduling priority around the operations that are about to wait.\n\nEach cycle the scheduler reads the warp’s six-bit barrier state and a small stall counter, and makes the eligibility decision for each warp. This is how a GPU hides latency with close to zero hardware-scheduling overhead.\n\n### Loading the data\n\nWhen a warp scheduler does find an eligible warp and issues the `LDG.E`\n\nloads,\nwe can follow the hardware requests down the memory hierarchy. Each of the 32\nthreads in the warp computes an address. Because our threads access consecutive\nelements of `float`\n\narrays (each 4 bytes), the warp requests a contiguous block\nof 128 bytes ( bytes).\n\nThe SM’s load/store unit detects this consecutive access pattern and performs\n**request coalescing**. It merges the 32 per-thread 4-byte requests into four\n32-byte sector requests. Fetches are in units of 32 bytes, so this is perfect\n— if the reads were not consecutive & coalesced like this we’d end up loading\nmore data than we needed.\n\nThe coalesced requests first check the SM’s local L1 Data Cache. If they miss,\nthey are routed through a high-bandwidth crossbar interconnect that links all\n128 SMs to the distributed slices of the 72 MB L2 Cache. If the requests miss\nin the L2 cache as well, they descend further to the memory controllers and\ntravel across the memory bus to the physical GDDR6X VRAM chipsThe RTX 4090 uses GDDR6X memory rather than the High-Bandwidth Memory\n(HBM) found in datacenter-class GPUs like the A100 or H100.. The\n`STG.E`\n\nstore that writes `c[i]`\n\nat the end of the loop follows the exact\nsame path in reverseIn principle anyway, we’ll see later that `c[i]`\n\nnever hits VRAM..\n\nIf we run our compiled kernel under the NVIDIA Nsight Compute profiler (`ncu`\n\n),\nwe can get some telling metrics:\n\n``` bash\n$ ncu --metrics \\\n    launch__grid_size,launch__block_size,launch__registers_per_thread,\\\n    launch__waves_per_multiprocessor,sm__warps_active.avg.pct_of_peak,\\\n    smsp__issue_active.avg.pct_of_peak,dram__throughput.avg.pct_of_peak,\\\n    gpu__time_duration.sum \\\n    ./vadd\n...\n----------------------------------------------------------\nMetric Name                                   Unit   Value\n----------------------------------------------------------\nlaunch__grid_size                                    4,096\nlaunch__block_size                                     256\nlaunch__registers_per_thread                            16\nlaunch__waves_per_multiprocessor                      5.33\nsm__warps_active.avg.pct_of_peak              %      82.77\nsmsp__issue_active.avg.pct_of_peak            %       5.17\ndram__throughput.avg.pct_of_peak              %      79.65\ngpu__time_duration.sum                        us     10.78\n----------------------------------------------------------\n```\n\n82.77% of warps were active over the run. The warps were issuing instructions 5.17% of the time. The DRAM was running at 79.65% of its maximum utilization.\n\nThe kernel has an extremely low **arithmetic intensity**: it\nperforms exactly one floating-point addition (`FADD`\n\n) and a tiny amount of\npointer arithmetic for every 12 bytes of data it transfers (two 4-byte loads\nand one 4-byte store).\n\nSo the `10.78`\n\ns just comes down to how fast the DRAM bus can feed the\nkernel its inputs, here about four-fifths of peakOnly the two inputs cross the bus, not the full 12 MB. `ncu`\n\nshows\n8.4 MB read from DRAM and essentially nothing written: the 4 MB output `c`\n\nfits\nin the 72 MB L2 and isn’t flushed to DRAM until the later device-to-host copy\nreads it back. The four-fifths-of-peak figure is the read side —\n8.4 MB / 10.78 s 780 GB/s..\n\n## Back to the CPU\n\nThe result is now sitting in the GPU’s L2 cache. The CPU is what runs our terminal, so it needs to get the result in order to show it to us. We return to its view of events.\n\nThe launch returned control to the CPU the moment the doorbell rang. So the GPU\nneeds to tell the CPU it’s done. When the last of our 4096 blocks retires, the\nGPU does so by posting a completion semaphore the QMD carried (the [fence\nfields](#reading-device-memory--qmd-layout) at words 23–24).\n\nThe device-to-host `cudaMemcpy(c, dc, …)`\n\nA pinned-memory `cudaMemcpyAsync`\n\nwould skip the wait and let the host\nrun ahead. copy sits behind the kernel on\nthe default stream, so the GPU’s copy engine (which performs the transfer) is\ngated on the semaphore. Once the value appears the GPU performs the DMA.\nBecause `c`\n\nis still sitting dirty in the 72 MB L2 — the `STG.E`\n\nstores never\nhad to spill it to DRAM — the engine’s reads are served straight from L2, and\nthe data crosses PCIe without a DRAM round trip.\n\nOnce the copy finishes, it posts its *own* semaphore, which the host was\nwaiting on in `cudaMemcpy`\n\n. `cudaMemcpy`\n\ncompletes on the host, `c`\n\nis ordinary\nhost memory again, and `printf`\n\nloads `c[0]`\n\nand `c[n-1]`\n\nout of RAM, formats\nthem into a string, and hands them to a `write`\n\nsyscall on stdout.\n\n## The whole path\n\nThe kernel source went through `cicc`\n\nto PTX and through `ptxas`\n\nto SASS, which\n`fatbinary`\n\npacked with a fallback copy of the PTX into a cubin-bearing fatbin\nthat the linker welded into an ordinary Linux executable. A constructor\nregistered that fatbin before `main`\n\n, mapping a host stub to a mangled device\nname. The first launch lazily uploaded the cubin to the GPU. `cuLaunchKernel`\n\nbuilt a QMD from the launch configuration, wrote it into a pushbuffer as GPU\nmethods, advanced `GP_PUT`\n\n, and rang a doorbell with a single MMIO store, at\nwhich point the GPU’s host engine fetched the work and handed the QMD to the\ncompute work distributor. The distributor spread 4096 blocks across 128 SMs at\nfull occupancy, four warp schedulers per SM issued 128-bit instructions whose\nstall counts the compiler had written, and a coalesced memory path pulled the\ninputs through DRAM at four-fifths of peak bandwidth to compute, in each of a\nmillion lanes, a single sum. A completion semaphore and a copy engine then\ncarried that result back across the bus to where `printf`\n\nwas waiting, and we\nlearnt that:\n\n```\nc[0]=2.000000 c[n-1]=2.000000\n```\n\n## Appendix: how to look inside the launch\n\nClaude & I used a lot of different tricks to see the different parts of the\nkernel launch happen here. Some of it comes from painstakingly reading the\n[open kernel modules](https://github.com/nvidia/open-gpu-kernel-modules).\n\nA few claims in this post can’t be read off the open source, because `libcuda`\n\nis closed source. To figure them out, there are a few useful diagnostic hooks.\n\n### An interposition hook\n\nDriver method writes never go through a syscall (the driver writes them\nstraight into a write-combined buffer it has already mapped), so to find them\nyou need to read the memory. We used an `LD_PRELOAD`\n\nshim that wraps `mmap`\n\n,\nrecords every region the driver maps from the `/dev/nvidia*`\n\nfile, and exposes\na function a test program calls just after the launch returns to dump them:\n\n```\n#define _GNU_SOURCE\n#include <stdio.h>\n#include <stdlib.h>\n#include <dlfcn.h>\n#include <sys/mman.h>\n#include <unistd.h>\n#include <string.h>\n\n// Dynamic linker function pointers\nstatic void* (*orig_mmap)(void*, size_t, int, int, int, off_t) = NULL;\n\n// Store captured channel mappings\nstruct Map {\n    void* addr;\n    size_t length;\n    off_t offset;\n    char path[256];\n} maps[128];\nstatic int map_count = 0;\n\nvoid* mmap(void* addr, size_t length, int prot, int flags, int fd, off_t offset) {\n    if (!orig_mmap) {\n        orig_mmap = dlsym(RTLD_NEXT, \"mmap\");\n    }\n    void* ret = orig_mmap(addr, length, prot, flags, fd, offset);\n    if (ret != MAP_FAILED && fd != -1 && map_count < 128) {\n        char proclink[256];\n        char path[256];\n        sprintf(proclink, \"/proc/self/fd/%d\", fd);\n        ssize_t len = readlink(proclink, path, sizeof(path) - 1);\n        if (len != -1) {\n            path[len] = '\\0';\n            // We care about NVIDIA device files\n            if (strstr(path, \"/dev/nvidia\")) {\n                maps[map_count].addr = ret;\n                maps[map_count].length = length;\n                maps[map_count].offset = offset;\n                strcpy(maps[map_count].path, path);\n                map_count++;\n            }\n        }\n    }\n    return ret;\n}\n\n// Expose a function to dump memory ranges holding the pushbuffer\nvoid dump_pushbuffer() {\n    printf(\"\\n=== [Shim] Dump of Mapped Pushbuffers ===\\n\");\n    for (int i = 0; i < map_count; i++) {\n        // User-space channels/pushbuffers are mapped at large sizes\n        if (maps[i].length >= 0x1000) {\n            unsigned int* ptr = (unsigned int*)maps[i].addr;\n            printf(\"Mapping %d: %s, at %p (%zu bytes), offset 0x%lx\\n\",\n                   i, maps[i].path, maps[i].addr, maps[i].length, (long)maps[i].offset);\n\n            // Walk the words looking for a method-header burst\n            for (size_t j = 0; j < maps[i].length / 4; j++) {\n                unsigned int word   = ptr[j];\n                unsigned int opcode = (word >> 29) & 0x7;     // 1 = INC\n                unsigned int count  = (word >> 16) & 0x1FFF;  // payload words\n                unsigned int method = (word & 0xFFF) << 2;    // register offset\n\n                // 0x318 is SET_INLINE_QMD_ADDRESS_A, the start of the inline burst\n                if (opcode == 1 && method == 0x318) {\n                    printf(\"  [+] Method burst at word %zu: header = 0x%08X\\n\", j, word);\n                    printf(\"      INC, count %d, offset 0x%04X\\n\", count, method);\n                    for (unsigned int k = 1; k <= count && (j + k) < (maps[i].length / 4); k++) {\n                        printf(\"      word %02u: 0x%08X\\n\", k, ptr[j + k]);\n                    }\n                }\n            }\n        }\n    }\n}\n```\n\nCompile it into a shared library:\n\n``` bash\n$ gcc -shared -fPIC -o shim.so shim.c -ldl\n```\n\nand then call `dump_pushbuffer()`\n\nfrom the test program just after the kernel\nlaunch, and run it with the shim preloaded so this `mmap`\n\nruns in place of\nlibc’s:\n\n``` bash\n$ LD_PRELOAD=./shim.so ./vadd\n```\n\nThe driver maps a write-combined buffer for the channel; the dump walks it and prints the launch’s method burst. Which we then need to decode.\n\n### Decoding the pushbuffer command stream\n\nA pushbuffer method is a header word followed by data words. The header packs\nfour fields (defined as `NVC46F_DMA_INCR_*`\n\nmacros in `clc46f.h`\n\n):\n\n**bits 31:29**— opcode:`0x1`\n\nis an increasing-method write (`INC_METHOD`\n\n/`INCR_OPCODE_VALUE`\n\n),`0x3`\n\nis a non-increasing-method write (`NON_INC_METHOD`\n\n), and`0x4`\n\nis an immediate-data write (`IMMD_DATA_METHOD`\n\n).**bits 28:16**— count: the number of payload words (`NVC46F_DMA_INCR_COUNT`\n\n).**bits 15:13**— subchannel index: routes the commands to a specialized backend engine context (`NVC46F_DMA_INCR_SUBCHANNEL`\n\n).**bits 11:0**— the method’s register offset, divided by four (`NVC46F_DMA_INCR_ADDRESS`\n\n), as the shim shifts it back.\n\nThere are two launch paths that seem relevant here. The methods are defined per\ncompute class in `src/common/sdk/nvidia/inc/class/`\n\n— `clc3c0.h`\n\n(Volta),\n`clc5c0.h`\n\n(Turing), `clc6c0.h`\n\n/`clc7c0.h`\n\n(Ampere), `clc9c0.h`\n\n(Ada),\n`clcbc0.h`\n\n(Hopper), `clcdc0.h`\n\n(Blackwell). The Ada header (`clc9c0.h`\n\n) is a\n29-line stub that only defines the class number `0xC9C0`\n\nand inherits the\nAmpere method set, so the definitions we actually read live in the Ampere\nheaders:\n\n`0x0318`\n\n—`SET_INLINE_QMD_ADDRESS_A`\n\n(defined as`NVC6C0_SET_INLINE_QMD_ADDRESS_A`\n\nin the Ampere header, inherited unchanged by Ada), which opens an inline-QMD burst streamed straight into the pushbuffer via`LOAD_INLINE_QMD_DATA(i)`\n\n(offset`0x0320 + i * 4`\n\n).`0x02b4`\n\n—`SEND_PCAS_A`\n\n, the out-of-line path, which carries only a pointer to a QMD that lives elsewhere in VRAM.\n\nFrom the dump, we can figure out which one ends up in the pushbuffer. The dump\nshows the inline path: one increasing-method burst, count 66, opening at\n`SET_INLINE_QMD_ADDRESS_A`\n\n. The 66 words are the two address words\n(`SET_INLINE_QMD_ADDRESS_A`\n\n/`_B`\n\n, `0x0318`\n\n/`0x031c`\n\n) followed by 64\n`LOAD_INLINE_QMD_DATA`\n\nwords (`0x0320`\n\nonward) — a 256-byte QMD carried inline.\nWithin it, word 12 is `0x1000`\n\nand word 18 is `0x100`\n\n: the 4096 and 256 of\n`vadd<<<4096, 256>>>`\n\n.\n\n### Reading device memory & QMD Layout\n\nThe Queue Meta Data (QMD) structure is represented as a multi-word layout, with\nfields defined as multi-word (MW) bits spanning 32-bit boundaries inside\n`src/common/sdk/nvidia/inc/class/cla0c0qmd.h`\n\n. The QMD stores several\naddress-like fields, but they aren’t all the same kind of value:\n\n`PROGRAM_OFFSET`\n\n— MW(287:256) (Word 8) is a**32-bit** entry-point offset relative to the channel’s code base, not a 64-bit pointer.`CONSTANT_BUFFER_ADDR_LOWER(i)`\n\n/`ADDR_UPPER(i)`\n\n— MW(959+i*64:928+i*64) (e.g. Constant Bank 0 holding our arguments sitting in Words 29–30).`RELEASE0_ADDRESS_LOWER/UPPER`\n\n— MW(767:736) (Words 23–24), used for fences/semaphores.`CIRCULAR_QUEUE_ADDR_LOWER/UPPER`\n\n— MW(319:288) (Words 9-10).\n\nThese point into device memory the CPU can’t read directly: a plain load faults,\nand both `cudaMemcpy`\n\nand `cuMemcpyDtoH`\n\nreject the address.\n\nSo we need to read it with the GPU. A small kernel copies 512 bytes from a raw pointer into a buffer the host can fetch:\n\n``` js\n__global__ void peek(const unsigned char* src, unsigned char* dst) {\n    for (int i = blockIdx.x * blockDim.x + threadIdx.x;\n         i < 512;\n         i += blockDim.x * gridDim.x) {\n        dst[i] = src[i];\n    }\n}\n```\n\nPointed at each of the QMD fields, exactly one returns all 512 bytes of the SASS. If we run a memory-scanning shim to look for valid GPU virtual addresses inside the QMD, we see a match at Word 48:\n\n``` php\nqmd[48] -> 0x74167b272300   512 / 512 bytes match\n```\n\nWhy is SASS matched at Word 48 (`qmd[48]`\n\n) when the driver’s program field is\n`PROGRAM_OFFSET`\n\nat Word 8?\n\nWord 8 holds only a 32-bit offset set by the driver, whereas words 48/49 are the\nhardware-owned `HW_ONLY_INNER_GET`\n\n(`MW(1566:1536)`\n\n) and `HW_ONLY_INNER_PUT`\n\n(`MW(1598:1568)`\n\n) fields. In the post-launch dump those words hold a full 64-bit\nGPU virtual address, and dereferencing the word-48 value returns the kernel\nSASS. The simplest reading is that the scheduler resolves the program offset\ninto these scheduler-owned fields at launch.\n\n### Decoding the driver’s ioctls\n\nThe command stream needs to be read from the memory, but `libcuda`\n\nsets up its\nmemory and GPU objects the ordinary way: by running `ioctl`\n\n(see Michael\nKerrisk, [ The Linux Programming Interface](https://www.amazon.com/Linux-Programming-Interface-System-Handbook/dp/1593272200),\nChapter 4 & 15) on the driver’s device files.\n\n`strace`\n\non the\none-kernel program records 948 of themAlmost all are one-time setup; a steady launch loop makes far fewer., almost all on two file descriptors\n— `/dev/nvidiactl`\n\nand `/dev/nvidia-uvm`\n\n:\n\n``` bash\n$ strace -f -e trace=ioctl ./vadd\n...\nioctl(8, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x900), ...)   # /dev/nvidiactl\nioctl(8, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x30),  ...)   # /dev/nvidiactl\nioctl(9, ...)                                                  # /dev/nvidia-uvm\n...\n```\n\nThe magic byte `0x46`\n\nis `'F'`\n\n, the NVIDIA resource manager’s ioctl magicThe ‘magic’ byte is a value every NVIDIA ioctl carries as a sanity check;\nsee [the Linux kernel\ndocumentation](https://kernel.org/doc/html/v5.4/process/magic-number.html)..\nThe command numbers decode against the open kernel modules’\n[ nv_escape.h](https://github.com/NVIDIA/open-gpu-kernel-modules/blob/590.48.01/src/nvidia/arch/nvalloc/unix/include/nv_escape.h#L27-L31):\n\n`0x2A`\n\nis `NV_ESC_RM_CONTROL`\n\nand `0x2B`\n\nis `NV_ESC_RM_ALLOC`\n\n.### Decoding the SASS control words\n\nThe stall counts, barriers and yield bits from [the eligibility\nsection](#what-does-it-mean-for-a-warp-to-be-eligible) come from a 21-bit control\nfield `ptxas`\n\npacks into the top of each instruction’s second 64-bit word, which\n`cuobjdump -sass`\n\nprints next to the mnemonic:\n\n```\n 20    17 16       11 10   8 7    5 4 3    0\n┌────────┬───────────┬──────┬──────┬─┬──────┐\n│ reuse  │ wait mask │ read │write │Y│stall │\n│  (4)   │    (6)    │ barr │ barr │ │ (4)  │\n└────────┴───────────┴──────┴──────┴─┴──────┘\n```\n\nThe two 3-bit indices name the scoreboard barriers the instruction sets, the\n6-bit mask is the barriers it waits on, `Y`\n\nis the yield bit, and `stall`\n\nis\nthe static cycle count. The layout is undocumented and reconstructed from\nmicrobenchmarkingThe clearest public reconstructions are the Citadel microbenchmarking\npapers ([Jia et al., “Dissecting the NVIDIA Volta GPU Architecture via\nMicrobenchmarking”](https://arxiv.org/abs/1804.06826)) and [these\nmaxas control-code notes](https://github.com/NervanaSystems/maxas/wiki/Control-Codes)\nfor Maxwell..\n\n### NVCC host registration callbacks\n\nIf you want to see the exact code the compiler generates to register your GPU code at startup, compiling with `nvcc --keep`\n\nlets you inspect `vadd.cudafe1.stub.c`\n\n.\n\nThe start-of-process registration is handled by an automatically generated constructor:\n\n```\n// from vadd.cudafe1.stub.c\nstatic void __sti____cudaRegisterAll(void) __attribute__((__constructor__));\n\nstatic void __nv_cudaEntityRegisterCallback(void **__T4) {\n    __cudaRegisterEntry(__T4, (void(*)(const float*, const float*, float*, int))vadd,\n                        _Z4vaddPKfS0_Pfi, -1);\n}\n\nstatic void __sti____cudaRegisterAll(void) {\n    __cudaRegisterBinary(__nv_cudaEntityRegisterCallback);\n}\n```\n\nThe `__attribute__((__constructor__))`\n\ndirective tells the linker to execute `__sti____cudaRegisterAll`\n\nbefore `main`\n\nstarts. It registers our device binary with the CUDA runtime and schedules the callback. When executed, `__cudaRegisterEntry`\n\nmaps the host function pointer `vadd`\n\nto the mangled device entry point `_Z4vaddPKfS0_Pfi`\n\n, building the hash table that `cudaLaunchKernel`\n\nqueries at launch time.\n\n[Suggest an edit](https://github.com/fergusfinn/blog/edit/main/src/content/blog/what-happens-when-you-run-a-gpu-kernel.mdx)\n\nLast modified: 29 Jun 2026", "url": "https://wpnews.pro/news/what-happens-when-you-run-a-cuda-kernel", "canonical_source": "https://fergusfinn.com/blog/what-happens-when-you-run-a-gpu-kernel/", "published_at": "2026-06-29 13:11:08+00:00", "updated_at": "2026-06-29 13:20:28.284071+00:00", "lang": "en", "topics": ["machine-learning"], "entities": ["NVIDIA", "CUDA", "RTX 4090", "LLVM", "PTX", "SASS"], "alternates": {"html": "https://wpnews.pro/news/what-happens-when-you-run-a-cuda-kernel", "markdown": "https://wpnews.pro/news/what-happens-when-you-run-a-cuda-kernel.md", "text": "https://wpnews.pro/news/what-happens-when-you-run-a-cuda-kernel.txt", "jsonld": "https://wpnews.pro/news/what-happens-when-you-run-a-cuda-kernel.jsonld"}}