96% of cuBLAS, no `unsafe`: what cuTile Rust proves NVIDIA researchers introduced cuTile Rust, a tile-based DSL that extends Rust's ownership and borrowing rules across the GPU launch boundary, enabling memory-safe GPU kernels without sacrificing performance. On an NVIDIA B200, cuTile Rust achieves roughly 96% of cuBLAS throughput on GEMM and 7 TB/s on memory-bound operations, while the companion Qwen3 inference engine Grout reaches 171 tokens/s for Qwen3-4B on an RTX 5090. GPU programming usually asks Rust developers to surrender the borrow checker at the launch boundary: references collapse into raw pointers, and aliasing, synchronization, and stream lifetimes become hand-managed invariants. A new NVIDIA Labs paper argues that trade is unnecessary. cuTile Rust is a tile-based DSL that carries Rust's ownership and borrowing rules across the host-to-GPU launch boundary — not just through host code. Introduced in "Fearless Concurrency on the GPU" arXiv:2606.15991 , submitted by NVIDIA researchers Melih Elibol, Jared Roesch, Isaac Gelado, Eric Buehler, and Michael Garland , it lets you author the kernel itself in idiomatic, memory-safe Rust rather than wrapping hand-written unsafe CUDA. The mechanism is type construction, not a runtime lock. Before launch, mutable output tensors are partitioned into provably disjoint tiles; each tile program then receives an exclusive &mut view of its slice, while inputs arrive as shared & references . Because the partitions cannot overlap, the kernel is single-threaded in its semantics and data-race-free by construction, yet still compiles to massively parallel GPU code. As Melih Elibol put it, "each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references" source: users.rust-lang.org https://users.rust-lang.org/t/fearless-concurrency-on-the-gpu-safe-gpu-kernels-in-rust/140790 . Explicit unchecked types remain available for local opt-out when you need lower-level control. The safety story would be academic if it cost throughput, but the reported numbers say otherwise. On an NVIDIA B200, cuTile Rust reaches 7 TB/s on memory-bound element-wise operations and 2 PFlop/s on GEMM — roughly 96% of cuBLAS, and within measurement noise of cuTile Python . End to end, the companion Qwen3 inference engine Grout reaches 171 generated tokens/s for Qwen3-4B on an RTX 5090 and 82 tokens/s for Qwen3-32B on a B200 in batch-1 decode . Those are the authors' own measurements on specific hardware — independent reproduction is not yet established — but they frame the central claim this article unpacks: safe Rust kernels without a measured performance penalty. Before any of that lands on your hardware, the crate sets a firm floor. cuTile Rust targets NVIDIA GPUs with compute capability sm 80 or higher — Ampere, Hopper, and Blackwell — which excludes Volta V100 and earlier . It builds on CUDA 13.3, Rust 1.89+, and Linux, tested on Ubuntu 24.04; Windows and macOS are unsupported, and no AMD/ROCm or Metal backend exists as of June 2026 . CUDA 13.x needs driver ≥580 for minor-version compatibility, and CUDA 13.3 GA corresponds to Linux driver ≥610.43.02 . | Requirement | Minimum | |---|---| | GPU compute capability | sm 80+ Ampere/Hopper/Blackwell | | CUDA toolkit | 13.3 | | Linux driver | ≥610.43.02 ≥580 for 13.x minor-compat | | Rust | 1.89+ | | OS | Linux Ubuntu 24.04 tested | | Tile IR toolchain | CMake 3.20+, C++17, Python 3.6+ | The Tile IR toolchain itself — cuda-tile-translate and tileiras , which compile MLIR-based Tile IR bytecode into cubins — expects CMake 3.20+, C++17, and Python 3.6+ . Confirm your driver and GPU first; everything below assumes the floor is met. Writing a cuTile Rust kernel means declaring a cutile::module block, annotating the function with cutile::entry , and bringing the prelude into scope with use cutile::prelude:: . The macro rewrites that function into a GPU kernel and auto-generates the host-side launcher that partitions tensors — you write no hand-rolled dispatch code . The canonical element-wise add reads like ordinary Rust: js cutile::module mod kernel { cutile::entry fn add