{"slug": "96-of-cublas-no-unsafe-what-cutile-rust-proves", "title": "96% of cuBLAS, no `unsafe`: what cuTile Rust proves", "summary": "NVIDIA researchers introduced cuTile Rust, a tile-based DSL that extends Rust's ownership and borrowing rules across the GPU launch boundary, enabling memory-safe GPU kernels without sacrificing performance. On an NVIDIA B200, cuTile Rust achieves roughly 96% of cuBLAS throughput on GEMM and 7 TB/s on memory-bound operations, while the companion Qwen3 inference engine Grout reaches 171 tokens/s for Qwen3-4B on an RTX 5090.", "body_md": "GPU programming usually asks Rust developers to surrender the borrow checker at the launch boundary: references collapse into raw pointers, and aliasing, synchronization, and stream lifetimes become hand-managed invariants. A new NVIDIA Labs paper argues that trade is unnecessary.\n\ncuTile Rust is a tile-based DSL that carries Rust's ownership and borrowing rules across the host-to-GPU launch boundary — not just through host code. Introduced in \"Fearless Concurrency on the GPU\" (arXiv:2606.15991), submitted by NVIDIA researchers Melih Elibol, Jared Roesch, Isaac Gelado, Eric Buehler, and Michael Garland , it lets you author the kernel itself in idiomatic, memory-safe Rust rather than wrapping hand-written unsafe CUDA.\n\nThe mechanism is type construction, not a runtime lock. Before launch, mutable output tensors are partitioned into provably disjoint tiles; each tile program then receives an exclusive `&mut`\n\nview of its slice, while inputs arrive as shared `&`\n\nreferences . Because the partitions cannot overlap, the kernel is single-threaded in its semantics and data-race-free by construction, yet still compiles to massively parallel GPU code. As Melih Elibol put it, \"each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references\" (source: [users.rust-lang.org](https://users.rust-lang.org/t/fearless-concurrency-on-the-gpu-safe-gpu-kernels-in-rust/140790)). Explicit unchecked types remain available for local opt-out when you need lower-level control.\n\nThe safety story would be academic if it cost throughput, but the reported numbers say otherwise. On an NVIDIA B200, cuTile Rust reaches 7 TB/s on memory-bound element-wise operations and 2 PFlop/s on GEMM — roughly 96% of cuBLAS, and within measurement noise of cuTile Python . End to end, the companion Qwen3 inference engine Grout reaches 171 generated tokens/s for Qwen3-4B on an RTX 5090 and 82 tokens/s for Qwen3-32B on a B200 in batch-1 decode . Those are the authors' own measurements on specific hardware — independent reproduction is not yet established — but they frame the central claim this article unpacks: safe Rust kernels without a measured performance penalty.\n\nBefore any of that lands on your hardware, the crate sets a firm floor. cuTile Rust targets NVIDIA GPUs with compute capability sm_80 or higher — Ampere, Hopper, and Blackwell — which excludes Volta (V100) and earlier . It builds on CUDA 13.3, Rust 1.89+, and Linux, tested on Ubuntu 24.04; Windows and macOS are unsupported, and no AMD/ROCm or Metal backend exists as of June 2026 . CUDA 13.x needs driver ≥580 for minor-version compatibility, and CUDA 13.3 GA corresponds to Linux driver ≥610.43.02 .\n\n| Requirement | Minimum |\n|---|---|\n| GPU compute capability | sm_80+ (Ampere/Hopper/Blackwell) |\n| CUDA toolkit | 13.3 |\n| Linux driver | ≥610.43.02 (≥580 for 13.x minor-compat) |\n| Rust | 1.89+ |\n| OS | Linux (Ubuntu 24.04 tested) |\n| Tile IR toolchain | CMake 3.20+, C++17, Python 3.6+ |\n\nThe Tile IR toolchain itself — `cuda-tile-translate`\n\nand `tileiras`\n\n, which compile MLIR-based Tile IR bytecode into cubins — expects CMake 3.20+, C++17, and Python 3.6+ . Confirm your driver and GPU first; everything below assumes the floor is met.\n\nWriting a cuTile Rust kernel means declaring a `#[cutile::module]`\n\nblock, annotating the function with `#[cutile::entry()]`\n\n, and bringing the prelude into scope with `use cutile::prelude::*`\n\n. The macro rewrites that function into a GPU kernel and auto-generates the host-side launcher that partitions tensors — you write no hand-rolled dispatch code . The canonical element-wise add reads like ordinary Rust:\n\n``` js\n#[cutile::module]\nmod kernel {\n  #[cutile::entry()]\n  fn add<const B: i32>(\n    z: &mut Tensor<f32, {[B]}>,  // exclusive write\n    x: &Tensor<f32, {[-1]}>,     // shared read\n    y: &Tensor<f32, {[-1]}>,     // shared read\n  ) {\n    let tx = load_tile_like(x, z);\n    let ty = load_tile_like(y, z);\n    z.store(tx + ty);\n  }\n}\n```\n\nThe signature is the contract. Mutable outputs are typed `&mut Tensor<f32, {[B]}>`\n\n; shared inputs are `&Tensor<f32, {[-1]}>`\n\n. The const-generic shape parameter encodes the tile size at the type level, so the borrow checker sees one exclusive writer and many immutable readers per tile .\n\nOn the host the recipe is short: create your tensors, call `.partition([128])`\n\non the mutable output before launch, then run `kernel::add(z, x, y).sync()?`\n\nfor blocking execution. The generated launcher holds the operands while GPU work is in flight, and ownership of the tensors returns to you only after `.sync()`\n\ncompletes . Because the partitions are provably disjoint, each tile program is single-threaded in its semantics and data-race-free by construction.\n\nFor inference pipelines, cuTile Rust exposes a lazy `DeviceOp`\n\nmodel. Use `.sync()`\n\nfor blocking dispatch, `.into_future()`\n\n(via `IntoFuture`\n\n) for async execution, and `.graph()`\n\n/ `CudaGraph::scope`\n\nfor CUDA graph capture and replay . The intended pattern builds a reusable layer graph once, borrows temporary buffers mutably inside each recorded op, and releases them after sync. Stream-order capture plus Rust lifetimes make buffer reuse visible to the type system, so ordering is enforced without manual annotation. Kernels JIT-compile through CUDA Tile IR, an MLIR-based intermediate representation, before reaching the GPU .\n\nThe safety idea is easy to feel out without a GPU. The illustrative Python below (executed; not the production Rust path) proves each tile's bounds once, then touches memory only through checked ranges — the same \"prove disjointness, then trust the slice\" shape cuTile Rust enforces at compile time:\n\n``` python\nfrom dataclasses import dataclass\nfrom random import Random\n\n@dataclass(frozen=True)\nclass Tile:\n    row: range\n    col: range\n    red: range\n\n    def proved(self, m: int, n: int, k: int) -> \"Tile\":\n        assert 0 <= self.row.start <= self.row.stop <= m\n        assert 0 <= self.col.start <= self.col.stop <= n\n        assert 0 <= self.red.start <= self.red.stop <= k\n        return self\n\ndef tiled_matmul(a, b, block=8):\n    m, k, n = len(a), len(a[0]), len(b[0])\n    c = [[0.0] * n for _ in range(m)]\n    proofs = 0\n    for i in range(0, m, block):\n        for j in range(0, n, block):\n            for p in range(0, k, block):\n                t = Tile(range(i, min(i + block, m)),\n                         range(j, min(j + block, n)),\n                         range(p, min(p + block, k))).proved(m, n, k)\n                proofs += 1\n                for r in t.row:\n                    for q in t.red:\n                        arq = a[r][q]\n                        for s in t.col:\n                            c[r][s] += arq * b[q][s]\n    return c, proofs\n\ndef plain_matmul(a, b):\n    return [[sum(x * y for x, y in zip(row, col)) for col in zip(*b)] for row in a]\n\nrng = Random(0)\nsize = 24\na = [[rng.random() for _ in range(size)] for _ in range(size)]\nb = [[rng.random() for _ in range(size)] for _ in range(size)]\ngot, proofs = tiled_matmul(a, b)\nwant = plain_matmul(a, b)\nerr = max(abs(got[i][j] - want[i][j]) for i in range(size) for j in range(size))\n\nprint(\"cuTile idea in Python: prove tile bounds once, then use only checked ranges.\")\nprint(f\"tiles proved: {proofs}; unsafe operations: 0\")\nprint(f\"max error vs reference: {err:.2e}\")\nprint(\"The 96%-of-cuBLAS claim is about Rust/CUDA performance; this shows the safety proof shape.\")\n```\n\ncuTile Rust is NVIDIA/CUDA-only today, and that constraint runs deep. There is no AMD/ROCm path, no Metal backend, and no portable WebGPU fallback — every kernel JIT-compiles through CUDA Tile IR into cubins . The compute-capability floor is hard: `sm_80`\n\n(Ampere) or newer, paired with CUDA 13.3, Rust 1.89+, and Linux . Any pre-Ampere card is excluded outright.\n\nThe surface API is explicitly early-stage. The `Tensor<f32, {[B]}>`\n\nconst-generic shape syntax and the `#[cutile::module]`\n\n/`#[cutile::entry()]`\n\nmacro forms can change between releases . Pin your dependency in `Cargo.lock`\n\nbefore this lands in CI; treat API churn as expected, not exceptional.\n\nBe precise about the headline numbers. The 96%-of-cuBLAS GEMM result and 171 tokens/s batch-1 decode for Qwen3-4B on an RTX 5090 are the authors' own measurements on specific hardware, including a B200 . An independent evaluation of the CUDA Tile *Python* stack reported 52–79% of cuBLAS for GEMM and only 53% of FlashAttention-2 throughput on RTX PRO 6000 Blackwell Server Edition — results that vary by workload and architecture . Multi-batch throughput, prefill latency, and model coverage beyond Qwen3 remain uncharacterized. Validate on your target GPU, batch distribution, and context length before you swap out a mature inference stack.\n\nIf you want to see cuTile Rust in a real decode path rather than a microbenchmark, read Grout. [Grout](https://github.com/huggingface/grout) is a cuTile-Rust Qwen3 inference engine co-authored by Eric Buehler, who also maintains mistral.rs, and it serves as the canonical production call-site pattern. Study how it structures lazy `DeviceOp`\n\ngraphs, borrows temporary buffers mutably inside `CudaGraph::scope`\n\ncapture, and recovers ownership only after `.sync()`\n\n— that ordering is the intended idiom for inference pipelines, where stream-order capture plus Rust lifetimes make buffer reuse visible to the type system.\n\nThis is the contrast that matters. Candle, Burn, and mistral.rs largely FFI into or wrap hand-written, often `unsafe`\n\nkernels; cuTile Rust offers a path to author the kernels themselves in safe Rust with no measured penalty. As lead author Melih Elibol frames the guarantee, \"each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references\" .\n\nConcrete next step: clone Grout, run the Qwen3-4B decode path — the authors report 171 generated tokens/s in batch-1 decode on an RTX 5090 — on an A100 or RTX 4090, and compare tok/s against a [vllm>=0.8.4](https://qwenlm.github.io/blog/qwen3/) baseline . The size of that gap — or its absence — is the real signal, not the headline.\n\nYou need an NVIDIA GPU with compute capability sm_80 (Ampere) or higher, plus CUDA 13.3, Rust 1.89+, and Linux (tested on Ubuntu 24.04) . That floor covers the RTX 3000/4000/5000 series, A100, H100, and B200, but excludes Volta (V100) and Turing (RTX 2000). On the driver side, CUDA 13.3 GA corresponds to a Linux driver of at least 610.43.02 .\n\nIt moves the guarantee to compile time. Mutable output tensors are partitioned on the host into provably non-overlapping tiles before dispatch, and each tile program receives an exclusive `&mut`\n\nview of its slice while inputs arrive as shared `&`\n\nreferences . Because the partitions cannot alias, Rust's borrow checker — which permits one mutable reference or many immutable ones — rules out conflicting writes statically . No runtime synchronization primitive is inserted; the kernel is single-threaded in its semantics yet compiles to massively parallel GPU code.\n\nNot yet. The authors describe it as early-stage, so the API surface — including the `Tensor<f32, {[B]}>`\n\nconst-generic shape syntax and the macro forms — may change . It is CUDA/Linux-only (sm_80+, CUDA 13.3), and multi-batch throughput, prefill, and broader model coverage beyond Qwen3 are uncharacterized. Grout is a useful reference call site, but validate your target GPU, driver, model, batch size, and graph-capture behavior before replacing a mature stack like vLLM or SGLang.\n\nNo. cuTile Rust JIT-compiles through CUDA Tile IR, which targets NVIDIA hardware (sm_80+) only, and as of June 2026 there is no ROCm, Metal, or WebGPU backend . The portable Rust-on-GPU ecosystem — Rust GPU and `wgpu`\n\n— does reach AMD and Apple Silicon, but it takes a different, non-CUDA approach and does not carry cuTile's ownership-across-launch model.\n\nThe authors report 171 generated tokens/s for Qwen3-4B batch-1 decode on an RTX 5090 and 82 tokens/s for Qwen3-32B on a B200, characterizing both as competitive with vLLM and SGLang and near the HBM roofline for memory-bound decoding . Treat that as the authors' own measurement — independent reproduction has not been published. For your own baseline, Qwen recommends `vllm>=0.8.4`\n\nor `sglang>=0.4.6.post1`\n\n.", "url": "https://wpnews.pro/news/96-of-cublas-no-unsafe-what-cutile-rust-proves", "canonical_source": "https://dev.to/creeta/96-of-cublas-no-unsafe-what-cutile-rust-proves-4ldp", "published_at": "2026-06-26 21:46:04+00:00", "updated_at": "2026-06-26 22:06:08.469560+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-research", "developer-tools", "ai-infrastructure"], "entities": ["NVIDIA Labs", "Melih Elibol", "Jared Roesch", "Isaac Gelado", "Eric Buehler", "Michael Garland", "cuTile Rust", "Grout"], "alternates": {"html": "https://wpnews.pro/news/96-of-cublas-no-unsafe-what-cutile-rust-proves", "markdown": "https://wpnews.pro/news/96-of-cublas-no-unsafe-what-cutile-rust-proves.md", "text": "https://wpnews.pro/news/96-of-cublas-no-unsafe-what-cutile-rust-proves.txt", "jsonld": "https://wpnews.pro/news/96-of-cublas-no-unsafe-what-cutile-rust-proves.jsonld"}}