cuda-oxide: a speed-of-light GEMM in pure Rust (companion notes for the stream) A developer built a matrix-multiply kernel in pure Rust, called cuda-oxide, that achieves 58% of NVIDIA's hand-tuned library performance on a Blackwell GPU. The kernel, gemm_sol_clc_multicast_4_stage_pipeline, is a few hundred lines long and incorporates eight distinct ideas to solve specific bottlenecks. The project demonstrates high-performance GPU computing using Rust. Image Credits: This walkthrough reads one GPU kernel: gemm sol clc multicast 4 stage pipeline , a matrix-multiply written in pure Rust that hits 58% of NVIDIA's hand-tuned library on a Blackwell GPU. It is a few hundred lines, and packed into it are about eight distinct ideas, each one solving a specific bottleneck. The plan: understand the problem, meet the one piece of silicon that does the actual math the tensor core , look at the kernel from the top, then walk it part by part. Every part teaches one idea: what it is, the bottleneck it removes, how it shows up in this exact kernel, and the Rust that expresses it. The method is the same throughout: find what is stalling, fix exactly that, repeat. The Problem: Multiplying Two Big Matrices https://gist.github.com/starred.atom the-problem-multiplying-two-big-matrices A Kernel Is Just a Rust Function https://gist.github.com/starred.atom a-kernel-is-just-a-rust-function The Hardware: Threads, Warps, Blocks, SMs, Clusters https://gist.github.com/starred.atom the-hardware-threads-warps-blocks-sms-clusters The Engine: Tensor Cores and the 8x8 Brick https://gist.github.com/starred.atom the-engine-tensor-cores-and-the-8x8-brick The Kernel in One Picture https://gist.github.com/starred.atom the-kernel-in-one-picture Walking the Kernel, One Idea at a Time https://gist.github.com/starred.atom walking-the-kernel-one-idea-at-a-time The Epilogue: Getting the Answer Out https://gist.github.com/starred.atom the-epilogue-getting-the-answer-out The Rust Toolbox https://gist.github.com/starred.atom the-rust-toolbox The Payoff https://gist.github.com/starred.atom the-payoff Key Takeaways https://gist.github.com/starred.atom key-takeaways Reproducing https://gist.github.com/starred.atom reproducing Source Material https://gist.github.com/starred.atom source-material Everything in this kernel is one operation: multiply two matrices. Take two 4096-by-4096 grids of numbers, A and B , and produce a third, C = A times B . K =4096 N =4096 N =4096 ┌───────────┐ ┌───────────────┐ ┌───────────────┐ M │ A │ × K│ B │ = M│ C │ 4096│ M × K │ │ K × N │ 4096│ M × N │ └───────────┘ └───────────────┘ └───────────────┘ A is M rows by K columns. B is K rows by N columns. C is M rows by N columns. For us, M = K = N = 4096. M , N , and K are the three sizes. K is the shared one: it lines A's columns up with B's rows, and it is the dimension that disappears in the product. How is one cell of C computed? A single number C i, j is the dot product of row i of A with column j of B: multiply element by element, add it all up. C i, j = row i of A · column j of B = A i,0 ·B 0,j + A i,1 ·B 1,j + ... + A i,4095 ·B 4095,j └──────────────── sum over all K = 4096 terms ────────────┘ The scale, so you feel why this is a GPU job. C has M × N = about 16.7 million cells. Each cell is a sum of K = 4096 multiply-adds. That is roughly 137 billion multiply-adds for one product, and real workloads do thousands of these per second. Two facts from this picture drive every decision in the kernel: C is enormous 16.7M cells , so the work must be split up. That is grid tiling . Each cell sums over all of K 4096 terms , and K is far too long to hold in fast memory at once, so it must be walked. That is the K-loop . Groundwork before the big kernel: what GPU code looks like in cuda-oxide. Here is the simplest possible kernel, adding two vectors. kernel pub fn vecadd a: & f32 , b: & f32 , mut c: DisjointSlice