{"slug": "cuda-oxide-a-speed-of-light-gemm-in-pure-rust-companion-notes-for-the-stream", "title": "cuda-oxide: a speed-of-light GEMM in pure Rust (companion notes for the stream)", "summary": "A developer built a matrix-multiply kernel in pure Rust, called cuda-oxide, that achieves 58% of NVIDIA's hand-tuned library performance on a Blackwell GPU. The kernel, gemm_sol_clc_multicast_4_stage_pipeline, is a few hundred lines long and incorporates eight distinct ideas to solve specific bottlenecks. The project demonstrates high-performance GPU computing using Rust.", "body_md": "Image Credits:\n\nThis walkthrough reads **one** GPU kernel: `gemm_sol_clc_multicast_4_stage_pipeline`\n\n,\na matrix-multiply written in pure Rust that hits 58% of NVIDIA's hand-tuned\nlibrary on a Blackwell GPU. It is a few hundred lines, and packed into it are\nabout eight distinct ideas, each one solving a specific bottleneck.\n\nThe plan: understand the problem, meet the one piece of silicon that does the actual math (the tensor core), look at the kernel from the top, then walk it part by part. Every part teaches one idea: what it is, the bottleneck it removes, how it shows up in this exact kernel, and the Rust that expresses it.\n\nThe method is the same throughout: **find what is stalling, fix exactly that,\nrepeat.**\n\n[The Problem: Multiplying Two Big Matrices](https://gist.github.com/starred.atom#the-problem-multiplying-two-big-matrices)[A Kernel Is Just a Rust Function](https://gist.github.com/starred.atom#a-kernel-is-just-a-rust-function)[The Hardware: Threads, Warps, Blocks, SMs, Clusters](https://gist.github.com/starred.atom#the-hardware-threads-warps-blocks-sms-clusters)[The Engine: Tensor Cores and the 8x8 Brick](https://gist.github.com/starred.atom#the-engine-tensor-cores-and-the-8x8-brick)[The Kernel in One Picture](https://gist.github.com/starred.atom#the-kernel-in-one-picture)[Walking the Kernel, One Idea at a Time](https://gist.github.com/starred.atom#walking-the-kernel-one-idea-at-a-time)[The Epilogue: Getting the Answer Out](https://gist.github.com/starred.atom#the-epilogue-getting-the-answer-out)[The Rust Toolbox](https://gist.github.com/starred.atom#the-rust-toolbox)[The Payoff](https://gist.github.com/starred.atom#the-payoff)[Key Takeaways](https://gist.github.com/starred.atom#key-takeaways)[Reproducing](https://gist.github.com/starred.atom#reproducing)[Source Material](https://gist.github.com/starred.atom#source-material)\n\nEverything in this kernel is one operation: multiply two matrices. Take two\n4096-by-4096 grids of numbers, `A`\n\nand `B`\n\n, and produce a third, `C = A times B`\n\n.\n\n```\n        K (=4096)            N (=4096)              N (=4096)\n     ┌───────────┐      ┌───────────────┐      ┌───────────────┐\n   M │     A     │  ×  K│       B       │  =  M│       C       │\n(4096│  (M × K)  │      │    (K × N)    │ (4096│    (M × N)    │\n)    └───────────┘      └───────────────┘ )    └───────────────┘\n\n  A is M rows by K columns.   B is K rows by N columns.\n  C is M rows by N columns.   For us, M = K = N = 4096.\n```\n\n`M`\n\n, `N`\n\n, and `K`\n\nare the three sizes. `K`\n\nis the shared one: it lines A's\ncolumns up with B's rows, and it is the dimension that disappears in the product.\n\n**How is one cell of C computed?** A single number `C[i, j]`\n\nis the dot product\nof row `i`\n\nof A with column `j`\n\nof B: multiply element by element, add it all up.\n\n```\n   C[i, j]  =  (row i of A)  ·  (column j of B)\n\n            =  A[i,0]·B[0,j] + A[i,1]·B[1,j] + ... + A[i,4095]·B[4095,j]\n               └──────────────── sum over all K = 4096 terms ────────────┘\n```\n\n**The scale, so you feel why this is a GPU job.** C has `M × N`\n\n= about 16.7\nmillion cells. Each cell is a sum of `K`\n\n= 4096 multiply-adds. That is roughly\n137 billion multiply-adds for one product, and real workloads do thousands of\nthese per second.\n\nTwo facts from this picture drive every decision in the kernel:\n\n**C is enormous**(16.7M cells), so the work must be split up. That is*grid tiling*.**Each cell sums over all of K**(4096 terms), and K is far too long to hold in fast memory at once, so it must be walked. That is the*K-loop*.\n\nGroundwork before the big kernel: what GPU code looks like in cuda-oxide. Here is the simplest possible kernel, adding two vectors.\n\n```\n#[kernel]\npub fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {\n    let idx = thread::index_1d();\n    let idx_raw = idx.get();\n    if let Some(c_elem) = c.get_mut(idx) {\n        *c_elem = a[idx_raw] + b[idx_raw];\n    }\n}\n```\n\nA normal Rust function with a `#[kernel]`\n\nattribute. The backend turns it into\nGPU assembly. Inputs are real slices. The output is a `DisjointSlice`\n\n, and that\none type is what makes the write safe:\n\n```\n  CUDA C++ raw `float* c`             cuda-oxide `DisjointSlice<f32>`\n  ─────────────────────               ──────────────────────────────\n  thread 3 ─┐                         thread 3 ─► [ ][ ][✓][ ][ ]\n  thread 5 ─┴─► [ ][ ][✗][ ]          each thread gets a unique,\n            two threads, one cell     bounds-checked cell, proven\n            (the race compiles)       by the type system\n```\n\nTo write through a `DisjointSlice`\n\nyou need a `ThreadIndex`\n\n, which only the\nhardware-register function `index_1d()`\n\ncan mint, and which is `!Copy`\n\n. You\ncannot photocopy your proof of uniqueness, so two threads cannot get a write\nticket to the same cell. The data race and the out-of-bounds write are\nunrepresentable in safe code. The big kernel uses this same type as its output.\n\nThe kernel talks in terms of \"warps,\" \"blocks,\" and \"clusters.\" Five words of GPU vocabulary make it readable. They nest, smallest to largest, and each level up adds one new power: first more workers, then a shared scratchpad, then a scratchpad shared across blocks.\n\n```\n   thread     one worker, with its own registers\n     │\n     │  32 of them run the SAME instruction in lockstep\n     ▼\n   WARP       32 threads; the GPU hands out work one warp at a time\n     │\n     │  several warps placed on ONE physical core\n     ▼\n   BLOCK      a team of warps on one SM, sharing a fast on-chip scratchpad\n   (a \"CTA\")\n     │\n     │  a few blocks on neighbouring SMs, allowed to see each other's memory\n     ▼\n   CLUSTER    blocks that can read and write each other's scratchpad\n     │\n     ▼\n   GRID       every block of the whole kernel launch\n```\n\n**Thread**: the smallest worker, one lane with its own registers.** Warp = 32 threads.**They run one instruction together, in lockstep (the GPU's \"SIMT\" model). Issue \"load\" on a warp and all 32 lanes load at once. Because work is handed out one warp at a time, this kernel will assign jobs per warp, not per thread.**Block**(also called a** CTA**, cooperative thread array): a group of warps that runs on one SM and shares a small, fast on-chip scratchpad called** shared memory**(up to 228 KB on Blackwell). Threads in a block can synchronize and pass data through it. A block is the unit that owns one 128x128 output tile in the matmul.**SM (streaming multiprocessor)**: the physical core that runs a block. Each SM holds the CUDA cores, the tensor core, and the copy engine (all introduced next), plus the shared-memory scratchpad. A Blackwell B200 has 148 of them.**Cluster**: a few blocks placed on neighbouring SMs that are allowed to read and write each other's shared memory (called distributed shared memory). This is new on Hopper and Blackwell, and it is the reason this kernel can make two blocks cooperate as one.**Grid**: every block of one kernel launch, spread across all the SMs.\n\nMemory follows the same ladder: each **thread** has registers, each **block**\nhas shared memory, and the whole **grid** shares global memory (the big, slow\nHBM where A, B, and C actually live). Fast and tiny at the top, huge and slow at\nthe bottom. Most of the kernel's cleverness is keeping data high on that ladder.\n\nBefore reading the kernel, you have to know the one piece of silicon doing the multiply, because every optimization in it is about feeding that piece correctly. A modern GPU has two kinds of math unit:\n\n**CUDA cores**: the ordinary, general-purpose lanes. They do the loads, the stores, the index math, the format conversions. Everything except the big matrix multiply.**Tensor cores**: a dedicated matrix-multiply engine. This is the only part fast enough to make 137 billion multiply-adds tractable. Each SM has one.\n\n**The fundamental unit the tensor core works in is an 8x8 tile, the \"core\nmatrix.\"** Eight rows by eight columns of fp16 numbers: that is 64 numbers, and\nin fp16 it is exactly **128 bytes**. Think of it as a single brick. Every larger\nmultiply the tensor core does is built out of these bricks.\n\n```\n   A 128×128 tile, the way the tensor core sees it: a wall of 8×8 bricks.\n\n      8    8    8           8\n    ┌────┬────┬────┬ ··· ┬────┐\n  8 │ ▦  │ ▦  │ ▦  │     │ ▦  │   each ▦ is one 8×8 core matrix:\n    ├────┼────┼────┼ ··· ┼────┤     64 fp16 numbers = 128 bytes\n  8 │ ▦  │ ▦  │ ▦  │     │ ▦  │     = the tensor core's atom\n    ├────┼────┼────┼ ··· ┼────┤\n     ...                            16 bricks across × 16 bricks down\n    └────┴────┴────┴ ··· ┴────┘     = 256 bricks make one 128×128 tile\n```\n\nYou do not place bricks by hand. You hand the tensor core a small 32-bit\n*descriptor* that says \"operands are fp16, accumulate in fp32, the shape is this\nbig,\" and it does the rest. In CUDA you assemble that descriptor's bits with\nmagic shifts. In cuda-oxide it is a `const fn`\n\nbuilder that folds to one constant\nat compile time, with compile-checked enums instead of shifts:\n\n``` js\nlet idesc = Tcgen05InstructionDescriptor::builder()\n    .shape(Tcgen05MmaShape::M256_N128)          // output shape (more on 256 later)\n    .element_type(Tcgen05ElementType::F16)      // inputs are fp16\n    .accumulator_type(Tcgen05AccumulatorType::F32)  // sum in fp32\n    .build()\n    .raw();\n```\n\n**One detail that matters for Idea 3.** The tensor core does not read a brick in\none gulp. It reads it row by row, and the row stride it expects is **hardwired to\n16 bytes** (8 fp16 numbers). It assumes the 8 rows of a brick sit 16 bytes apart.\nRemember that number. A whole optimization exists just to satisfy it.\n\nWith that vocabulary in hand, here is the whole kernel from the top before we\nzoom in. Three things to take away from this picture: it runs **blocks in\npairs**, each block is a **crew of 6 warps** with fixed jobs, and data flows\ndown a fixed assembly line.\n\n```\n   #[cluster_launch(2, 1, 1)]   →   blocks run in pairs (\"CTA pairs\"),\n                                     one block per SM, two SMs cooperating.\n\n   ┌───────────────── one CTA pair: leader (SM 0) + follower (SM 1) ─────────────────┐\n   │                                                                                  │\n   │  Each block = 6 warps (192 threads). Jobs are fixed:                             │\n   │                                                                                  │\n   │   Warp 4  LOADER     ─ streams A and B from global memory into shared memory     │\n   │                        (TMA copies, broadcast to both blocks).   [CUDA cores]    │\n   │                              │ \"buffer ready\"      ▲ \"buffer free\"               │\n   │                              ▼                      │                            │\n   │   Warp 5  MULTIPLIER ─ leader issues ONE paired MMA spanning both blocks;        │\n   │                        follower issues none.      [drives the TENSOR core]       │\n   │                              │ \"tile done\"                                       │\n   │                              ▼                                                   │\n   │   Warps 0-3 STORE    ─ read the result, convert f32→bf16, write to global.       │\n   │                                                                  [CUDA cores]    │\n   └──────────────────────────────────────────────────────────────────────────────────┘\n\n   The assembly line for the data:\n\n     global memory  ──TMA, multicast──►  shared memory (4-stage ring buffer)\n                                              │  tensor core reads bricks\n                                              ▼\n                                          tensor memory (the fp32 accumulator)\n                                              │  epilogue drains it\n                                              ▼\n                                          shared → global  (the answer)\n```\n\nNow we walk it. Each idea below is one piece of this picture.\n\n**The bottleneck.** C has 16.7 million cells. No single group of threads can\ncompute all of them.\n\n**The idea.** Cut C into small 128-by-128 tiles and give each tile to one block\nof threads (a \"CTA\"). For a 4096 matrix that is 32 tiles across and 32 down:\n1024 tiles, spread across the whole GPU.\n\n```\n        N ─────────────────────────►\n     ┌──────┬──────┬──────┬─── ... ──┐\n   M │ tile │ tile │ tile │          │   Each block owns ONE 128×128\n   │ │(0,0) │(0,1) │(0,2) │          │   tile of C, and reads:\n   │ ├──────┼──────┼──────┤          │     • a 128-row BAND of A\n   ▼ │ tile │ tile │      │          │     • a 128-col BAND of B\n     │(1,0) │ ···  │      │          │\n     └──────┴──────┴──────┴── ... ───┘\n                 C (M × N)\n```\n\nA block computes its tile coordinates from its block id, reads the matching band\nof A and band of B, and produces one 128x128 chunk of C. Two payoffs:\n**parallelism** (many tiles keep every SM busy) and **reuse** (a whole tile\nhits the same A and B bands repeatedly, so you load them into fast memory once).\n\n**In the kernel.** The output is a `DisjointSlice<u32>`\n\n(each `u32`\n\npacks two\nbf16 results). Each block writes a different, non-overlapping region of it, so\nthe writes provably cannot race:\n\n``` js\npub unsafe fn gemm_sol_clc_multicast_4_stage_pipeline(\n    a_tma: *const TmaDescriptor,    // A and B come pre-described for the\n    b_tma: *const TmaDescriptor,    // copy engine (see Idea 3)\n    mut out: DisjointSlice<u32>,    // the output tile surface\n    n: i32, k: i32, tiles_m: u32, _tiles_n: u32,\n) {\n    let first_tile_m = tile_idx % tiles_m;   // which tile this block owns\n    let first_tile_n = tile_idx / tiles_m;\n```\n\n**The bottleneck.** A block owns a 128x128 tile, but each cell still sums over\nall `K`\n\n= 4096 terms. The A and B bands are 4096 long and do not fit in fast\nmemory at once.\n\n**The idea.** Walk K in chunks. Load a 64-wide slice of the bands, multiply,\nadd the partial result into the tile, repeat until K is exhausted.\n\n```\n   The band, walked in chunks of 64 (4096 / 64 = 64 iterations):\n\n   A band:  [ chunk0 ][ chunk1 ][ chunk2 ] ... [ chunk63 ]\n   B band:  [ chunk0 ][ chunk1 ][ chunk2 ] ... [ chunk63 ]\n                 │         │         │              │\n   tile  =   A0×B0  +   A1×B1  +   A2×B2  + ... +  A63×B63\n\n   First chunk:  tile  = A0 × B0      (overwrite)\n   Every later:  tile += Ak × Bk      (accumulate)\n   After chunk 63: the tile is the finished answer.\n```\n\nThe partial sums never leave the tensor core's private fp32 accumulator (tensor memory) until the tile is complete. That is the inner loop of the whole algorithm.\n\n**In the kernel.** `k_iters = k / 64`\n\n, and `accumulate`\n\nis false only on the\nvery first multiply, true forever after:\n\n``` js\nlet k_iters = k / 64;            // 4096 / 64 = 64 chunks\nwhile k_idx < k_iters {\n    // ... get this chunk's buffers, wait for data ...\n    let accumulate = k_idx > 0 || j > 0;   // overwrite once, then add\n    tcgen05_mma_f16_cg2(tmem_addr + offset, a_desc, b_desc, idesc, accumulate);\n    k_idx += 1;\n}\n```\n\n(Each 64-wide chunk is actually 4 back-to-back MMA instructions, since one instruction digests 16 of the 64 K-elements. The brick wall from earlier is being multiplied 16 columns at a time.)\n\n**The bottleneck.** The loader copies each K-chunk into shared memory, and the\ntensor core reads its bricks from there. But a plain, natural row-by-row copy\nlays the data out wrong for the brick reader.\n\nThis is the picture that explains swizzling. A natural copy of a\n64-wide chunk puts each row 128 bytes after the last (64 fp16 numbers per row).\nBut the tensor core's brick reader steps in **16-byte** rows (it wants 8-wide\nbricks, remember). So after the first row of each brick, it reads garbage.\n\n```\n   What a plain copy produces        What the tensor core's brick\n   (rows 128 bytes apart):           reader expects (rows 16 bytes apart):\n\n   row 0 ─► byte 0                    row 0 ─► byte 0\n   row 1 ─► byte 128                  row 1 ─► byte 16\n   row 2 ─► byte 256                  row 2 ─► byte 32\n        the reader steps 16 bytes,         every row lands exactly where\n        so it lands inside row 0's         the hardwired 16-byte stride\n        data and multiplies garbage        points: correct bricks\n```\n\nHow do you get the bricks laid out the way the reader wants? Two options:\n\n**Chop the copy into eight skinny strips**, each one already brick-shaped. This works, but it is eight transfers where you wanted one. Slow.** Swizzle.**Ask the copy engine (TMA) to*scramble the bytes into brick order as it transfers them*. One big copy lands already brick-aligned. That is \"swizzling\": a fixed, hardware-understood reshuffle applied in flight.\n\n```\n   Option 1: 8 skinny strips per chunk     Option 2: 1 swizzled copy\n   ┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐                 ┌────────────────────────┐\n   │ ││ ││ ││ ││ ││ ││ ││ │   ──────────►   │ scrambled into brick   │\n   └─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘                 │ order by the TMA engine│\n   8× transfer overhead                     └────────────────────────┘\n                                             1× transfer, same layout\n```\n\n**There is a second reason the scramble is shaped the way it is: bank\nconflicts.** Shared memory is split into 32 \"banks,\" and they work like 32\nsupermarket checkout lanes. 32 threads reading 32 different banks all get served\nin one cycle. But if many threads hit the *same* bank, they queue and get served\none at a time.\n\n```\n   No swizzle: a brick's rows all          Swizzle: the reshuffle spreads\n   land in the SAME bank → threads         those rows across different banks\n   queue at one checkout lane              → all served in 1 cycle\n\n      lane ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒                    lane ▒ . . . . . . .\n           (one bank, serialized)                  . ▒ . . . . . .\n                                                    . . ▒ . . . . .\n```\n\nThe swizzle does both jobs at once: it lands the bricks in the layout the tensor\ncore's 16-byte stride expects, *and* it spreads each brick's bytes across the 32\nbanks so reads never collide.\n\n**In the kernel.** Swizzling is one named mode, threaded into the shared-memory\ndescriptor that tells the tensor core where its bricks are. No change to the\nloop, the math, or the epilogue:\n\n``` js\nconst SWIZZLE_128B: u8 = 2;     // a hardware-known reshuffle pattern\nlet a_desc = build_smem_descriptor(smem_a_base + off, LBO_BYTES, SBO_BYTES, SWIZZLE_128B);\n```\n\n**The bottleneck.** Inside the K-loop, the naive rhythm is: load a chunk, wait,\nmultiply, wait. The tensor core sits idle during every load, and the copy engine\nsits idle during every multiply. They take turns when they could overlap.\n\n**The idea.** Keep several shared-memory buffers (a ring buffer of \"stages\") so\nthe loader can fill the next buffers while the tensor core chews on the current\none. This kernel keeps **four** stages, deep enough that the loader runs three\nchunks ahead.\n\n```\n   1 buffer (serial):                 4 buffers (overlap):\n     [load 0]                           [load 0][load 1][load 2][load 3]\n            [calc 0]                            [calc 0][calc 1][calc 2]…\n            [load 1]                     loader stays 3 chunks ahead;\n                   [calc 1]              the tensor core never waits for a load\n```\n\nWhy four and not two? Because of the next two ideas: clusters add a per-chunk handshake between blocks, and a deep buffer is what gives the loader enough runway to hide that handshake. Shallow pipelines stall on it.\n\n**In the kernel.** The stage for each K-iteration is just the low two bits of a\nrunning counter, and which buffers/barriers to use is one *exhaustive* `match`\n\n:\n\n``` js\nlet stage = global_k & 3;            // 0,1,2,3 rotating\nlet (smem_a, smem_b, tma_bar, mma_bar) = match stage {\n    0 => ( &raw mut SMEM_A0, &raw mut SMEM_B0, .. ),\n    1 => ( &raw mut SMEM_A1, &raw mut SMEM_B1, .. ),\n    2 => ( &raw mut SMEM_A2, &raw mut SMEM_B2, .. ),\n    _ => ( &raw mut SMEM_A3, &raw mut SMEM_B3, .. ),   // wildcard makes it total\n};\n```\n\n**The bottleneck.** Here is the subtle one. If a single team of threads issues\n\"load, then multiply, then load,\" it is *serial no matter how many buffers you\ngive it.* The 4-stage pipeline from Idea 4 does nothing on its own. The buffers\nwere never the problem. Having one worker do both jobs was.\n\n**The idea.** Warps are the lever here (recall: 32 threads, scheduled as one\nunit). This kernel runs **6 warps per block (192 threads)** and gives **each\nwarp a single dedicated job**, like a kitchen with a prep cook, a line cook, and\na plating crew instead of one person doing all three.\n\n```\n   The block as a specialized crew: 6 warps, 32 threads each.\n\n   Warp 0 ─┐\n   Warp 1  │   STORE CREW (epilogue, 128 threads): once a tile is\n   Warp 2  │   finished, read it out of tensor memory, convert\n   Warp 3 ─┘   f32 → bf16, write it to global.        ← CUDA cores (STORE)\n\n   Warp 4      LOADER (producer): does nothing but issue TMA copies,\n               streaming A and B chunks into the buffers.   ← CUDA cores (LOAD)\n\n   Warp 5      MULTIPLIER (consumer): does nothing but issue MMA\n               instructions to the tensor core.   ← drives the TENSOR core (MATH)\n```\n\nThis is the answer to \"how many warps, and which ones do the CUDA-core work?\"\nSix warps. The **tensor core** does only the multiply-add, and only warp 5 talks\nto it. **Everything else is CUDA-core work**: warp 4 does the loads, warps 0-3\ndo the format conversion and the stores. Load and store are the CUDA cores;\nmultiply is the tensor core; and now they all run at once.\n\nThe loader (warp 4) and the multiplier (warp 5) hand buffers back and forth\nthrough lightweight signals (mbarriers): \"stage ready\" from loader to\nmultiplier, \"stage free\" back the other way. No lockstep, no `sync_threads`\n\nin\nthe K-loop.\n\n```\n   One team (serial):                 Specialized crew (overlapped):\n   load, compute, load, compute…      Warp 4: load→load→load→…  (only loads)\n                                       │ \"ready\"   ▲ \"free\"\n                                       ▼           │\n                                      Warp 5: wait→MMA→wait→MMA (only MMAs)\n```\n\n**In the kernel.** The roles are named constants, and the body splits into\nblocks gated on `warp_id`\n\n:\n\n``` js\nconst TMA_WARP: u32 = 4;\nconst MMA_WARP: u32 = 5;\nif warp_id == TMA_WARP { /* producer: only issues loads          */ }\nif warp_id == MMA_WARP { /* consumer: only issues paired MMAs     */ }\nif warp_id < 4         { /* epilogue: tensor memory → bf16 → global */ }\n```\n\nPipelining (Idea 4) and warp specialization (Idea 5) are one optimization with two halves: the buffers are the racetrack, the specialized warps are the cars. Neither does anything without the other, which is why this kernel has both.\n\n**The bottleneck.** If you launch one fresh block per tile, you pay launch\noverhead 1024 times, and the final step of each tile (writing it out) leaves the\ntensor core idle between tiles.\n\n**The idea.** Launch one long-lived (\"persistent\") block per SM that loops:\nfinish a tile, then grab the next tile to compute. How do blocks agree on who\ngets which tile? A shared counter in global memory works but becomes a traffic\njam (every block hammering one address). This kernel instead lets the **hardware\nscheduler** hand out tiles (Cluster Launch Control, CLC), with no global-memory\ntraffic at all.\n\n```\n   Global atomic counter (jam):       On-chip scheduler (CLC):\n   blk ─┐    ┌──────────────┐         blk ─► \"give me a tile\"\n   blk ─┼───►│ counter [73] │              ◄─ \"take tile 73\"\n   blk ─┘    └──────────────┘         tens of cycles, no global traffic\n   contention grows with tile count\n```\n\n**In the kernel.** The loader warp's body is two loops, one nested in the other.\nThe block first computes its *home* tile (the one its `blockIdx`\n\npoints at) by\nrunning the K-loop of Idea 2: 64 iterations (`k_iters = K/64`\n\n), each streaming\none 64-wide K-slab into the pipeline (a 128x64 piece of A and a 64x64 piece of B,\nper block). That inner loop is the loader's entire contribution to one output\ntile. Then an outer `loop`\n\nasks the scheduler for *another* tile and runs the\nsame 64-step K-loop again, until CLC reports no work is left. One block, many\ntiles.\n\n```\n// HOME tile: stream all 64 K-chunks (this inner while IS the K-loop of Idea 2)\nwhile k_idx < k_iters { /* TMA-load one 64-wide K-slab of A and B */ }\n\nloop {                                                      // then steal more tiles\n    clc_try_cancel_multicast(resp_ptr, &raw mut CLC_BAR);   // ask for a tile\n    if clc_query_is_canceled(resp_lo, resp_hi) == 0 { break; }  // none left → done\n    let tile_idx = clc_query_get_first_ctaid_x(resp_lo, resp_hi) / 2;\n    while k_idx < k_iters { /* the SAME 64-step K-loop, for the stolen tile */ }\n}\n```\n\nThe loader does not preload all 64 chunks at once: with only four pipeline\nstages, the `mma_bar`\n\nwait makes it pause until the multiplier frees a stage, so\nit stays about four chunks ahead and no further. A running tile counter keeps the\nstage rotation and barrier parity continuous across tile boundaries, so the\npipeline does not hiccup at the seam between one tile and the next.\n\nThe block also runs *two* accumulator slots in tensor memory, so the store crew\ncan be draining tile N while the multiplier already starts tile N+1. The tensor\ncore never idles waiting for a write-out.\n\nBonus lever (large matrices only).Persistent blocks also let you choose theordertiles are computed. Sweeping a small band of columns before advancing rows keeps the A and B strips that neighbors share resident in the on-chip L2 cache, turning slow DRAM re-reads into cache hits. At 4096-cubed the data already fits and it is a no-op; at 16384-cubed this single reordering is worth nearly +90%. Same math, same output, just a better visit order.\n\n**The bottleneck.** Notice the `#[cluster_launch(2, 1, 1)]`\n\non the function: it\ngroups blocks into **clusters of two** that run together and can see each other's\nshared memory. The two blocks in a cluster work on neighboring output tiles that\nneed overlapping data. Each one loading that shared data separately is redundant\nmemory traffic.\n\n**The idea.** Have one block load the shared operand once and let the hardware\n**broadcast** (multicast) it to both blocks in the cluster. One trip to memory,\nfanned out on-chip.\n\n```\n   Without multicast:                 With multicast:\n   block 0 ─► loads its own copy       GMEM ─► block 0 ─► block 1\n   block 1 ─► loads its own copy       one load, hardware fans it to both\n```\n\nA warning from history: naively widening the cluster and broadcasting *added*\noverhead, because the broadcast forces a per-chunk handshake across the cluster,\nand a shallow pipeline had no slack to hide it. That is exactly why this kernel\npairs a **narrow** cluster (just two blocks) with the **deep** 4-stage pipeline\nof Idea 4. The depth pays for the handshake.\n\n**In the kernel.** The load is a multicast bulk-copy; the cluster size lives in\nthe `#[cluster_launch(...)]`\n\nattribute:\n\n```\n#[cluster_launch(2, 1, 1)]       // clusters of 2 blocks\n// ...\ncp_async_bulk_tensor_2d_g2s_multicast_cg2(smem_a_ptr, a_tma, k, m, aliased_bar, self_mask);\ncp_async_bulk_tensor_2d_g2s_multicast_cg2(smem_b_ptr, b_tma, k, n, aliased_bar, self_mask);\n```\n\n**The bottleneck.** Two blocks in a cluster, each driving its own tensor core,\neach issuing its own MMA and its own synchronization. Can the pair share more?\n\n**The idea.** Pair the two blocks so that a **single tensor-core instruction\nspans both**. With `cta_group::2`\n\n, the leader block issues one MMA whose shape is\n`M256_N128`\n\n: it reads operands from both blocks' shared memory and writes results\ninto both blocks' tensor memory. One instruction computes a 256-row output tile\n(two stacked 128-row tiles). The follower block just loads and stores; it never\nissues an MMA.\n\n```\n   Two 128-row tiles stacked into one 256-row tile, computed by ONE MMA:\n\n   block 0 (leader)   ─ owns rows   0..127 ─┐\n                                            ├─► one M256_N128 paired MMA\n   block 1 (follower) ─ owns rows 128..255 ─┘    reads both blocks' SMEM,\n                                                 writes both blocks' TMEM\n```\n\nFor that paired MMA to fire, both blocks' \"load done\" signals have to land on\nthe *same* barrier. The trick: mask one bit of the barrier address so both\nblocks point at the leader's barrier. That makes the cross-block handshake\nnearly free.\n\n```\n   block 0's barrier addr ─┐\n                           ├─► (addr & 0xFEFFFFF8) ─► leader's barrier\n   block 1's barrier addr ─┘\n   The mask clears the bit that distinguishes the two blocks, so their\n   separate \"I'm loaded\" signals merge into one the leader waits on.\n```\n\n**In the kernel.** The dangerous pointer arithmetic is isolated behind one named\nconstant and one clearly named local, instead of a magic number buried in the\nhot loop; the paired MMA is a single `cta_group::2`\n\ncall:\n\n``` js\nconst PEER_BIT_MASK: u32 = 0xFEFFFFF8;\nlet aliased_bar = ((tma_bar_mut as u32) & PEER_BIT_MASK) as *mut Barrier;\n// ... leader only:\ntcgen05_mma_f16_cg2(tmem_addr + offset, a_desc, b_desc, idesc, accumulate);  // spans both blocks\n```\n\nThis is what turns the cluster from a cost into a win: the pair shares both the load (multicast, Idea 7) and the multiply (paired MMA, Idea 8).\n\nWhen the K-loop finishes, the answer for the tile lives in the tensor core's private fp32 accumulator (tensor memory). It is not in a form you can write to global memory yet. The store crew (warps 0-3) drains it:\n\n```\n   tensor memory (fp32)  ──read──►  registers  ──convert──►  bf16\n        (CuSimd group)                              │\n                                                    ▼\n                              shared memory  ──coalesced store──►  global memory\n                              (staged with stmatrix)               (the DisjointSlice)\n```\n\n**Read** the accumulator out of tensor memory into a register group (`CuSimd`\n\n).**Convert** each fp32 pair to a packed bf16 pair (`cvt_f32x2_bf16x2`\n\n).**Stage** the bf16 results into shared memory in a tidy layout (`stmatrix`\n\n).**Store** that shared buffer to global memory, the only writable surface, the`DisjointSlice`\n\noutput.\n\nBecause the block runs two accumulator slots, this drain happens for tile N while the multiplier is already grinding tile N+1. The store latency hides behind the next tile's math.\n\n``` js\nlet regs_a = tcgen05_ld_16x256b_pure(tmem_addr + ...);  // read fp32 from TMEM\ntcgen05_load_wait();\nlet p0 = cvt_f32x2_bf16x2(regs_a[0], regs_a[1]);        // f32 → bf16\nstmatrix_m8n8_x2(smem_addr, p0, p1);                    // stage in shared\n// ... then a coalesced copy from SMEM_OUT to the global `out` slice\n```\n\nThree abstractions that are not GPU ideas, but are how Rust keeps the kernel's resource use correct.\n\n**Type-state: a lifecycle the compiler checks.** Tensor memory and barriers must\nbe allocated, used, then freed in order, or you get a leak or silent corruption.\ncuda-oxide encodes that order in the type:\n\n```\n   TmemUninit ──alloc()──► TmemReady ──dealloc()──► TmemDeallocated\n                              │                          │\n                         address()                  (no methods)\n                                                  address() here?\n                                                  ✗ COMPILE ERROR\n```\n\n`alloc`\n\nconsumes the `Uninit`\n\nhandle and returns `Ready`\n\n. `address`\n\nonly exists\non `Ready`\n\n. Use-before-alloc and use-after-free do not compile. Barriers get the\nsame treatment, with a \"kind\" so the load barrier and the compute barrier are\ndifferent types.\n\n**CuSimd: register groups as one type.** When the tensor core hands back its\nresult, it arrives as a group of registers. cuda-oxide models that as one\nindexable value, not a pile of named fields and a giant switch:\n\n```\n   CUDA C++:  float r0,r1,…r31;        cuda-oxide:  CuSimd<f32, 32>\n              switch (i) { … }                      regs[i]   // Index trait\n```\n\n**Newtypes: no same-width mix-ups.** A tensor-map is a `TmaDescriptor`\n\n, not a\n`void*`\n\n. A barrier token is a `BarrierToken`\n\n, not a `u64`\n\n. Zero-cost, but a wrong\nargument becomes a compile error instead of a silent swap.\n\nThis one kernel runs at about **868 TFLOPS, 58% of a live cublasLtMatmul FP16\nbaseline** at 4096-cubed on a B200. A few hundred lines of Rust, reaching well\npast half of a library NVIDIA has tuned for years.\n\nThe interesting part is how the ideas combine. Two of them do **nothing on their\nown**: the 4-stage pipeline (Idea 4) is dead weight without warp specialization\n(Idea 5) to put a second worker on it, and clustering + multicast (Idea 7)\nactually *loses* performance until the deep pipeline and CTA pairs (Idea 8) hide\nand amortize its handshake. That is why a speed-of-light kernel is not eight\nindependent tricks bolted together. It is eight ideas chosen so each one pays for\nthe next one's cost.\n\n**The problem is simple to state.** A (M×K) times B (K×N) equals C (M×N). Every cell of C is a dot product over K. C is huge and K is long; that is the entire source of the difficulty.**The tensor core works in 8x8 bricks.** Everything else in the kernel is about feeding that brick engine: laying bricks out right (swizzle), keeping it fed (pipeline + warp specialization), and sharing its inputs (multicast, pairs).**CUDA cores load and store; the tensor core multiplies.** In this kernel that is six warps: warp 4 loads, warps 0-3 store, warp 5 drives the tensor core, all at once.**Clusters turn two blocks into one machine.** Multicast shares the load, the paired MMA shares the multiply, and barrier aliasing makes the handshake between them nearly free.**Ideas pay for each other.** The pipeline needs specialized warps; the cluster needs the deep pipeline. The wins are in the combinations, not the parts.**The Rust you know carries over.**`std::sync::atomic`\n\n, exhaustive`match`\n\n,`loop { ... break }`\n\n, the`Index`\n\ntrait, type-state, RAII, and`const fn`\n\nbuilders that fold to one constant so the readable form is the fast form.\n\n```\n# Run the kernels: correctness tests + benchmarks\ncargo oxide run gemm_sol\n\n# Watch the device-code pipeline (MIR -> dialect-mir -> LLVM -> PTX)\ncargo oxide pipeline gemm_sol\n```\n\nThe kernel lives in `crates/rustc-codegen-cuda/examples/gemm_sol/src/main.rs`\n\nas\n`gemm_sol_clc_multicast_4_stage_pipeline`\n\n, a `pub unsafe fn`\n\ninside the\n`#[cuda_module] mod kernels`\n\nblock. Generated `gemm_sol.ll`\n\nand `gemm_sol.ptx`\n\nland next to the example. Requires sm_100+ (Blackwell) to execute; on older GPUs\nonly PTX generation is verified.\n\n`gemm_sol`\n\nexample:`crates/rustc-codegen-cuda/examples/gemm_sol/`\n\n- The abstractions live in the\n`cuda-device`\n\ncrate (`DisjointSlice`\n\n,`SharedArray`\n\n,`TmemGuard`\n\n,`ManagedBarrier`\n\n,`CuSimd`\n\n, the tensor-core builders) and`cuda-host`\n\n(`#[cuda_module]`\n\n,`LaunchConfig`\n\n). - Conceptual grounding:\n*Modern GPU Programming for MLSys*([https://mlc.ai/modern-gpu-programming-for-mlsys/index.html](https://mlc.ai/modern-gpu-programming-for-mlsys/index.html)), especially the tensor-core, data-layout, and advanced-GEMM chapters. - The repo:\n[https://github.com/NVlabs/cuda-oxide](https://github.com/NVlabs/cuda-oxide)", "url": "https://wpnews.pro/news/cuda-oxide-a-speed-of-light-gemm-in-pure-rust-companion-notes-for-the-stream", "canonical_source": "https://gist.github.com/nihalpasham/7c6b8b2c9e19790c218416b6ba4d9767", "published_at": "2026-06-26 16:16:16+00:00", "updated_at": "2026-06-30 03:48:42.182260+00:00", "lang": "en", "topics": ["developer-tools", "machine-learning", "large-language-models"], "entities": ["NVIDIA", "Blackwell GPU", "cuda-oxide", "Rust", "gemm_sol_clc_multicast_4_stage_pipeline"], "alternates": {"html": "https://wpnews.pro/news/cuda-oxide-a-speed-of-light-gemm-in-pure-rust-companion-notes-for-the-stream", "markdown": "https://wpnews.pro/news/cuda-oxide-a-speed-of-light-gemm-in-pure-rust-companion-notes-for-the-stream.md", "text": "https://wpnews.pro/news/cuda-oxide-a-speed-of-light-gemm-in-pure-rust-companion-notes-for-the-stream.txt", "jsonld": "https://wpnews.pro/news/cuda-oxide-a-speed-of-light-gemm-in-pure-rust-companion-notes-for-the-stream.jsonld"}}