96% of cuBLAS, no `unsafe`: what cuTile Rust proves

NVIDIA researchers introduced cuTile Rust, a tile-based DSL that extends Rust's ownership and borrowing rules across the GPU launch boundary, enabling memory-safe GPU kernels without sacrificing performance. On an NVIDIA B200, cuTile Rust achieves roughly 96% of cuBLAS throughput on GEMM and 7 TB/s on memory-bound operations, while the companion Qwen3 inference engine Grout reaches 171 tokens/s for Qwen3-4B on an RTX 5090.

GPU programming usually asks Rust developers to surrender the borrow checker at the launch boundary: references collapse into raw pointers, and aliasing, synchronization, and stream lifetimes become hand-managed invariants. A new NVIDIA Labs paper argues that trade is unnecessary. cuTile Rust is a tile-based DSL that carries Rust's ownership and borrowing rules across the host-to-GPU launch boundary — not just through host code. Introduced in "Fearless Concurrency on the GPU" arXiv:2606.15991 , submitted by NVIDIA researchers Melih Elibol, Jared Roesch, Isaac Gelado, Eric Buehler, and Michael Garland , it lets you author the kernel itself in idiomatic, memory-safe Rust rather than wrapping hand-written unsafe CUDA. The mechanism is type construction, not a runtime lock. Before launch, mutable output tensors are partitioned into provably disjoint tiles; each tile program then receives an exclusive &mut view of its slice, while inputs arrive as shared & references . Because the partitions cannot overlap, the kernel is single-threaded in its semantics and data-race-free by construction, yet still compiles to massively parallel GPU code. As Melih Elibol put it, "each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references" source: users.rust-lang.org https://users.rust-lang.org/t/fearless-concurrency-on-the-gpu-safe-gpu-kernels-in-rust/140790 . Explicit unchecked types remain available for local opt-out when you need lower-level control. The safety story would be academic if it cost throughput, but the reported numbers say otherwise. On an NVIDIA B200, cuTile Rust reaches 7 TB/s on memory-bound element-wise operations and 2 PFlop/s on GEMM — roughly 96% of cuBLAS, and within measurement noise of cuTile Python . End to end, the companion Qwen3 inference engine Grout reaches 171 generated tokens/s for Qwen3-4B on an RTX 5090 and 82 tokens/s for Qwen3-32B on a B200 in batch-1 decode . Those are the authors' own measurements on specific hardware — independent reproduction is not yet established — but they frame the central claim this article unpacks: safe Rust kernels without a measured performance penalty. Before any of that lands on your hardware, the crate sets a firm floor. cuTile Rust targets NVIDIA GPUs with compute capability sm 80 or higher — Ampere, Hopper, and Blackwell — which excludes Volta V100 and earlier . It builds on CUDA 13.3, Rust 1.89+, and Linux, tested on Ubuntu 24.04; Windows and macOS are unsupported, and no AMD/ROCm or Metal backend exists as of June 2026 . CUDA 13.x needs driver ≥580 for minor-version compatibility, and CUDA 13.3 GA corresponds to Linux driver ≥610.43.02 . | Requirement | Minimum | |---|---| | GPU compute capability | sm 80+ Ampere/Hopper/Blackwell | | CUDA toolkit | 13.3 | | Linux driver | ≥610.43.02 ≥580 for 13.x minor-compat | | Rust | 1.89+ | | OS | Linux Ubuntu 24.04 tested | | Tile IR toolchain | CMake 3.20+, C++17, Python 3.6+ | The Tile IR toolchain itself — cuda-tile-translate and tileiras , which compile MLIR-based Tile IR bytecode into cubins — expects CMake 3.20+, C++17, and Python 3.6+ . Confirm your driver and GPU first; everything below assumes the floor is met. Writing a cuTile Rust kernel means declaring a cutile::module block, annotating the function with cutile::entry , and bringing the prelude into scope with use cutile::prelude:: . The macro rewrites that function into a GPU kernel and auto-generates the host-side launcher that partitions tensors — you write no hand-rolled dispatch code . The canonical element-wise add reads like ordinary Rust: js cutile::module mod kernel { cutile::entry fn add<const B: i32 z: &mut Tensor<f32, { B } , // exclusive write x: &Tensor<f32, { -1 } , // shared read y: &Tensor<f32, { -1 } , // shared read { let tx = load tile like x, z ; let ty = load tile like y, z ; z.store tx + ty ; } } The signature is the contract. Mutable outputs are typed &mut Tensor<f32, { B } ; shared inputs are &Tensor<f32, { -1 } . The const-generic shape parameter encodes the tile size at the type level, so the borrow checker sees one exclusive writer and many immutable readers per tile . On the host the recipe is short: create your tensors, call .partition 128 on the mutable output before launch, then run kernel::add z, x, y .sync ? for blocking execution. The generated launcher holds the operands while GPU work is in flight, and ownership of the tensors returns to you only after .sync completes . Because the partitions are provably disjoint, each tile program is single-threaded in its semantics and data-race-free by construction. For inference pipelines, cuTile Rust exposes a lazy DeviceOp model. Use .sync for blocking dispatch, .into future via IntoFuture for async execution, and .graph / CudaGraph::scope for CUDA graph capture and replay . The intended pattern builds a reusable layer graph once, borrows temporary buffers mutably inside each recorded op, and releases them after sync. Stream-order capture plus Rust lifetimes make buffer reuse visible to the type system, so ordering is enforced without manual annotation. Kernels JIT-compile through CUDA Tile IR, an MLIR-based intermediate representation, before reaching the GPU . The safety idea is easy to feel out without a GPU. The illustrative Python below executed; not the production Rust path proves each tile's bounds once, then touches memory only through checked ranges — the same "prove disjointness, then trust the slice" shape cuTile Rust enforces at compile time: python from dataclasses import dataclass from random import Random @dataclass frozen=True class Tile: row: range col: range red: range def proved self, m: int, n: int, k: int - "Tile": assert 0 <= self.row.start <= self.row.stop <= m assert 0 <= self.col.start <= self.col.stop <= n assert 0 <= self.red.start <= self.red.stop <= k return self def tiled matmul a, b, block=8 : m, k, n = len a , len a 0 , len b 0 c = 0.0 n for in range m proofs = 0 for i in range 0, m, block : for j in range 0, n, block : for p in range 0, k, block : t = Tile range i, min i + block, m , range j, min j + block, n , range p, min p + block, k .proved m, n, k proofs += 1 for r in t.row: for q in t.red: arq = a r q for s in t.col: c r s += arq b q s return c, proofs def plain matmul a, b : return sum x y for x, y in zip row, col for col in zip b for row in a rng = Random 0 size = 24 a = rng.random for in range size for in range size b = rng.random for in range size for in range size got, proofs = tiled matmul a, b want = plain matmul a, b err = max abs got i j - want i j for i in range size for j in range size print "cuTile idea in Python: prove tile bounds once, then use only checked ranges." print f"tiles proved: {proofs}; unsafe operations: 0" print f"max error vs reference: {err:.2e}" print "The 96%-of-cuBLAS claim is about Rust/CUDA performance; this shows the safety proof shape." cuTile Rust is NVIDIA/CUDA-only today, and that constraint runs deep. There is no AMD/ROCm path, no Metal backend, and no portable WebGPU fallback — every kernel JIT-compiles through CUDA Tile IR into cubins . The compute-capability floor is hard: sm 80 Ampere or newer, paired with CUDA 13.3, Rust 1.89+, and Linux . Any pre-Ampere card is excluded outright. The surface API is explicitly early-stage. The Tensor<f32, { B } const-generic shape syntax and the cutile::module / cutile::entry macro forms can change between releases . Pin your dependency in Cargo.lock before this lands in CI; treat API churn as expected, not exceptional. Be precise about the headline numbers. The 96%-of-cuBLAS GEMM result and 171 tokens/s batch-1 decode for Qwen3-4B on an RTX 5090 are the authors' own measurements on specific hardware, including a B200 . An independent evaluation of the CUDA Tile Python stack reported 52–79% of cuBLAS for GEMM and only 53% of FlashAttention-2 throughput on RTX PRO 6000 Blackwell Server Edition — results that vary by workload and architecture . Multi-batch throughput, prefill latency, and model coverage beyond Qwen3 remain uncharacterized. Validate on your target GPU, batch distribution, and context length before you swap out a mature inference stack. If you want to see cuTile Rust in a real decode path rather than a microbenchmark, read Grout. Grout https://github.com/huggingface/grout is a cuTile-Rust Qwen3 inference engine co-authored by Eric Buehler, who also maintains mistral.rs, and it serves as the canonical production call-site pattern. Study how it structures lazy DeviceOp graphs, borrows temporary buffers mutably inside CudaGraph::scope capture, and recovers ownership only after .sync — that ordering is the intended idiom for inference pipelines, where stream-order capture plus Rust lifetimes make buffer reuse visible to the type system. This is the contrast that matters. Candle, Burn, and mistral.rs largely FFI into or wrap hand-written, often unsafe kernels; cuTile Rust offers a path to author the kernels themselves in safe Rust with no measured penalty. As lead author Melih Elibol frames the guarantee, "each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references" . Concrete next step: clone Grout, run the Qwen3-4B decode path — the authors report 171 generated tokens/s in batch-1 decode on an RTX 5090 — on an A100 or RTX 4090, and compare tok/s against a vllm =0.8.4 https://qwenlm.github.io/blog/qwen3/ baseline . The size of that gap — or its absence — is the real signal, not the headline. You need an NVIDIA GPU with compute capability sm 80 Ampere or higher, plus CUDA 13.3, Rust 1.89+, and Linux tested on Ubuntu 24.04 . That floor covers the RTX 3000/4000/5000 series, A100, H100, and B200, but excludes Volta V100 and Turing RTX 2000 . On the driver side, CUDA 13.3 GA corresponds to a Linux driver of at least 610.43.02 . It moves the guarantee to compile time. Mutable output tensors are partitioned on the host into provably non-overlapping tiles before dispatch, and each tile program receives an exclusive &mut view of its slice while inputs arrive as shared & references . Because the partitions cannot alias, Rust's borrow checker — which permits one mutable reference or many immutable ones — rules out conflicting writes statically . No runtime synchronization primitive is inserted; the kernel is single-threaded in its semantics yet compiles to massively parallel GPU code. Not yet. The authors describe it as early-stage, so the API surface — including the Tensor<f32, { B } const-generic shape syntax and the macro forms — may change . It is CUDA/Linux-only sm 80+, CUDA 13.3 , and multi-batch throughput, prefill, and broader model coverage beyond Qwen3 are uncharacterized. Grout is a useful reference call site, but validate your target GPU, driver, model, batch size, and graph-capture behavior before replacing a mature stack like vLLM or SGLang. No. cuTile Rust JIT-compiles through CUDA Tile IR, which targets NVIDIA hardware sm 80+ only, and as of June 2026 there is no ROCm, Metal, or WebGPU backend . The portable Rust-on-GPU ecosystem — Rust GPU and wgpu — does reach AMD and Apple Silicon, but it takes a different, non-CUDA approach and does not carry cuTile's ownership-across-launch model. The authors report 171 generated tokens/s for Qwen3-4B batch-1 decode on an RTX 5090 and 82 tokens/s for Qwen3-32B on a B200, characterizing both as competitive with vLLM and SGLang and near the HBM roofline for memory-bound decoding . Treat that as the authors' own measurement — independent reproduction has not been published. For your own baseline, Qwen recommends vllm =0.8.4 or sglang =0.4.6.post1 .