{"slug": "show-hn-cutile-rust-safe-data-race-free-gpu-kernels-in-rust", "title": "Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust", "summary": "NVIDIA Research released cuTile Rust, a tile-based system for writing memory-safe, data-race-free GPU kernels in Rust. The project extends Rust's ownership model to GPU programming, achieving up to 92% of peak performance on NVIDIA B200 GPUs. cuTile Rust is available as an open-source research project and includes Grout, a Qwen3 inference engine built in collaboration with Hugging Face.", "body_md": "cuTile Rust (`cutile-rs`\n\n) is a tile-based system for writing memory-safe, data-race-free GPU kernels in idiomatic Rust. It extends Rust's ownership discipline across the GPU launch boundary: mutable tensors are partitioned into disjoint pieces before launch, immutable tensors are shared, and generated launchers preserve ownership while GPU work is in flight. The same model supports synchronous launches, asynchronous pipelines, and CUDA graph replay. The `#[cutile::module]`\n\nmacro embeds a captured Rust AST for each kernel in the host binary; when a kernel is needed, cuTile Rust JIT-compiles that AST through CUDA Tile IR into a GPU cubin. Local opt-outs remain available when lower-level control is needed.\n\nWe are excited to release this research project as a demonstration of how GPU programming can be made available in the Rust ecosystem. The software is in an early stage and under active development: you should expect bugs, incomplete features, and API breakage as we work to improve it. That being said, we hope you'll be interested to try it in your work and help shape its direction by providing feedback on your experience.\n\nPlease check out [CONTRIBUTING.md](/NVlabs/cutile-rs/blob/main/CONTRIBUTING.md) if you're interested in contributing.\n\n```\nuse cutile::prelude::*;\n\n#[cutile::module]\nmod kernel {\n    use cutile::core::*;\n\n    #[cutile::entry()]\n    fn add<const B: i32>(\n        z: &mut Tensor<f32, { [B] }>,\n        x: &Tensor<f32, { [-1] }>,\n        y: &Tensor<f32, { [-1] }>,\n    ) {\n        let tx = load_tile_like(x, z);\n        let ty = load_tile_like(y, z);\n        z.store(tx + ty);\n    }\n}\n\nfn main() -> Result<(), Error> {\n    let x = api::ones::<f32>(&[1024]);\n    let y = api::ones::<f32>(&[1024]);\n    let z = api::zeros::<f32>(&[1024]).partition([128]);\n\n    let (_z, _x, _y) = kernel::add(z, x, y).sync()?;\n    Ok(())\n}\n```\n\nThe `#[cutile::module]`\n\nmacro transforms `add`\n\ninto a GPU kernel and generates a host-side launcher. The host code constructs lazy tensor operations, partitions the mutable output into 128-element chunks, and calls `.sync()`\n\nto JIT-compile and execute the kernel.\n\nThe kernel signature carries the access discipline into device code: `z`\n\nis the exclusive mutable output, while `x`\n\nand `y`\n\nare shared read-only inputs. The body loads input tiles matching the output partition, adds them, and stores the result. The launch grid `(8, 1, 1)`\n\nis inferred from the partition: 1024÷128 = 8 tiles.\n\n- Run a similar example via\n`cargo run -p cutile-examples --example saxpy`\n\n. - More kernels and usage examples of the host-side API can be found\n[here](/NVlabs/cutile-rs/blob/main/cutile-examples/examples).\n\nThe cuTile Rust paper, *Fearless Concurrency on the GPU*, is available [here](https://arxiv.org/abs/2606.15991). On NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM, about 91% of peak memory bandwidth and 92% of dense `f16`\n\npeak, respectively. The GEMM result is competitive with cuBLAS, and the B200 safety-overhead microbenchmarks show that cuTile Rust adds safety without measurable runtime overhead: safe Rust persistent GEMM reaches 2.07 PFlop/s at `M=N=K=8192`\n\n(92% of the B200 dense `f16`\n\npeak), within 0.3% of the corresponding low-level Tile IR variant.\n\nThe paper also evaluates Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on B200, showing competitive state-of-the-art performance on memory-bound inference tasks as measured by our HBM roofline analysis.\n\nReproducibility artifacts for the paper evaluation are available [here](/NVlabs/cutile-rs/blob/main/cutile-benchmarks/paper). The paper-facing measurements were run against cuTile Rust 0.2.0, and the version of Grout used for the paper is available [here](https://github.com/huggingface/grout).\n\nIf you use cuTile Rust in research, please cite the paper:\n\n```\n@misc{elibol2026fearlessconcurrencygpu,\n  title = {Fearless Concurrency on the GPU},\n  author = {Elibol, Melih and Roesch, Jared and Gelado, Isaac and Buehler, Eric and Garland, Michael},\n  year = {2026},\n  eprint = {2606.15991},\n  archivePrefix = {arXiv},\n  primaryClass = {cs.PL},\n  url = {https://arxiv.org/abs/2606.15991}\n}\n```\n\n[Grout](https://github.com/huggingface/grout): Qwen 3 inference engine in Rust by Hugging Face, built with cuTile Rust and useful as a reference for production kernel call sites.[cuTile Python](https://github.com/nvidia/cutile-python): Python kernel programming with CUDA Tile.[TileGym](https://github.com/NVIDIA/TileGym): CUDA Tile kernel examples and tuning patterns.[cuda-oxide](https://github.com/NVlabs/cuda-oxide): NVlabs experimental Rust-to-CUDA compiler for writing SIMT-style GPU kernels in Rust.[CUDA Tile IR documentation](https://docs.nvidia.com/cuda/tile-ir/latest/index.html): CUDA Tile IR reference documentation.[CUDA documentation](https://docs.nvidia.com/cuda/): CUDA toolkit documentation.[Rust NVPTX backend](https://doc.rust-lang.org/rustc/platform-support/nvptx64-nvidia-cuda.html): rustc's target support for generating PTX for NVIDIA GPUs.\n\ncuTile Rust targets tile-based kernels that lower through CUDA Tile IR, with APIs built around tensor partitions and tensor-core-oriented operations.\n\n**NVIDIA GPU** with compute capability`sm_80`\n\nor higher (minimum supported architecture:`sm_80`\n\n).`sm_100+`\n\nis supported by CUDA 13.1+.`sm_8x`\n\nsupport was added in CUDA 13.2.- CUDA 13.3 adds\n`sm_90`\n\nsupport, so CUDA 13.3 users now have`sm_80+`\n\ncoverage.\n\n**CUDA** 13.3 recommended (`sm_80+`\n\nsupport and CUDA Tile IR 13.3 features such as FP4 packing and block-scaled MMA).**Rust** 1.89+**Linux**(tested on Ubuntu 24.04)\n\nTo install Rust:\n\n```\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\nrustup default stable\n```\n\nInstall CUDA 13.3 for your OS by following the official instructions:\n[https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)\n\nSet `CUDA_TOOLKIT_PATH`\n\nto your CUDA 13.3 install directory.\n\nExample `.cargo/config.toml`\n\n:\n\n```\n[env]\nCUDA_TOOLKIT_PATH = { value = \"/usr/local/cuda-13\", relative = false }\n```\n\nRun the hello world example:\n\n```\ncargo run -p cutile-examples --example hello_world\n```\n\nIf everything works, you should see: `Hello, I am tile <0, 0, 0> in a kernel with <1, 1, 1> tiles.`\n\nWe provide a Nix flake for easy setup and development. Flakes must be enabled in your Nix configuration, if not already, add to `~/.config/nix/nix.conf`\n\n:\n\n```\nexperimental-features = nix-command flakes\n```\n\nRun a command directly:\n\n```\nnix develop -c cargo run -p cutile-examples --example saxpy\n```\n\nOr open an interactive shell:\n\n```\nnix develop\n# cutile-rs dev shell\n#  ✓ CUDA  /nix/store/...-cuda-toolkit-13.3\n#  ✓ Rust  1.90.0-nightly\n```\n\nThe flake automatically locates host NVIDIA driver libraries on both NixOS and non-NixOS systems.\n\n- cuTile IR:\n`cargo test --package cutile-ir`\n\n- cuTile Rust Compiler:\n`cargo test --package cutile-compiler`\n\n- cuTile Rust Library:\n`cargo test --package cutile`\n\n- Examples: run an individual example, for example\n`cargo run -p cutile-examples --example async_gemm`\n\n- Benchmarks:\n`cargo bench`\n\n- Everything:\n`./scripts/run_all.sh`\n\n(or pipe to a log file:`./scripts/run_all.sh 2>&1 | tee test_run.log`\n\n)\n\n```\ncutile                 User-facing crate for authoring and executing tile kernels\n├── cutile-macro\n├── cutile-compiler\n├── cuda-async\n└── cuda-core\n\ncutile-kernels         Reusable cuTile Rust kernels\n└── cutile\n\ncutile-macro           cuTile Rust proc-macro\n└── cutile-compiler\n\ncutile-compiler        Compiles cuTile Rust kernels to executables\n├── cutile-ir\n├── cuda-async\n└── cuda-core\n\ncutile-ir              Pure Rust Tile IR builder and bytecode writer\n\ncuda-async             Async CUDA execution via async Rust\n└── cuda-core\n\ncuda-core              Idiomatic safe CUDA API\n└── cuda-bindings\n\ncuda-bindings          NVIDIA CUDA bindings\n```\n\nThe `cuda-bindings`\n\ncrate is licensed under NVIDIA Software License: [LICENSE-NVIDIA](/NVlabs/cutile-rs/blob/main/LICENSE-NVIDIA).\nAll other crates are licensed under the Apache License, Version 2.0 [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)", "url": "https://wpnews.pro/news/show-hn-cutile-rust-safe-data-race-free-gpu-kernels-in-rust", "canonical_source": "https://github.com/nvlabs/cutile-rs", "published_at": "2026-06-16 20:17:42+00:00", "updated_at": "2026-06-16 20:49:04.331297+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-research", "ai-tools", "developer-tools"], "entities": ["NVIDIA", "cuTile Rust", "Hugging Face", "Grout", "Qwen3", "CUDA", "B200", "RTX 5090"], "alternates": {"html": "https://wpnews.pro/news/show-hn-cutile-rust-safe-data-race-free-gpu-kernels-in-rust", "markdown": "https://wpnews.pro/news/show-hn-cutile-rust-safe-data-race-free-gpu-kernels-in-rust.md", "text": "https://wpnews.pro/news/show-hn-cutile-rust-safe-data-race-free-gpu-kernels-in-rust.txt", "jsonld": "https://wpnews.pro/news/show-hn-cutile-rust-safe-data-race-free-gpu-kernels-in-rust.jsonld"}}