A Case for Tracing Based DSL Kernel Languages NVIDIA's C++ template-based CUTLASS library for GPU kernels suffers from compile times of up to 20 seconds for a single kernel and over 17 minutes for full builds, prompting a shift toward Python-embedded DSLs. A tracing-based approach, as opposed to the parsing method used by Triton and CuTe-DSL, offers faster iteration by avoiding template instantiation overhead, with NVIDIA reporting up to 100x compilation speedups for Blackwell GEMM kernels using its CuTe Python DSL. On the architectural divide between parsing and tracing kernel DSLs, and what tends to go wrong in each. The language for writing NVIDIA GPU kernels was always exclusively CUDA, but since Triton https://github.com/triton-lang/triton appeared, a wave of Pythonic DSLs has followed: CuTe-DSL https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/overview.html , cuTile https://docs.nvidia.com/cuda/cutile-python/ , Pallas https://github.com/jax-ml/jax/blob/main/docs/pallas/design/design.md , Gluon https://github.com/triton-lang/triton/tree/main/python/triton/experimental/gluon , Warp https://github.com/NVIDIA/warp , and the more recent TileLang https://github.com/tile-ai/tilelang used in DeepSeek’s DeepGEMM. Most of these systems share the same goal of lowering a tile-oriented program into PTX or LLVM-IR, and are embedded in Python. The question is how to embed the DSL into Python. Triton and CuTe-DSL parse the source AST. Pallas runs the function under abstract values and traces the resulting operations. PyTorch’s torch.compile intercepts CPython bytecode rather than source, but that is still parsing, just against a smaller, post-desugared grammar; the same trade-offs apply. Most DSLs follow Triton’s lead and use parsing . This essay takes the alternative and argues that a tracing -based approach is often preferable. CUDA and Templates A CUDA kernel directly specifies the execution code for each thread. A textbook fused-softmax kernel in CUDA looks roughly like this: The element type T and the block size BLOCK SIZE must be known at compile time, as shared memory is statically sized, and the compiler must specialise loop bounds to enable vectorisation of the body. Hence any expansion of the supported configuration space multiplies the number of instantiations. Three element types and four block sizes already imply twelve instantiations, and the responsibility for dispatching among them rests with the caller. Adding more templates and more generalisations to CUDA, one eventually reaches a heavily templated CUTLASS-like state. CUTLASS: Building Blocks for CUDA Kernels CUTLASS is what C++ template metaprogramming looks like when taken as a way to write GPU kernels. Consider the declaration of its principal Gemm class, the entry point most users first encounter, from include/cutlass/gemm/device/gemm.h https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/device/gemm.h : A fragment of the canonical Hopper warp-specialized GEMM example shows how a user composes a kernel from nested CollectiveBuilders , each a template that pulls in dozens of further instantiations: examples/48 hopper warp specialized gemm/48 hopper warp specialized gemm.cu , lines 83–142. Shape< 128, 128, 32 denotes a type rather than a value, and the compiler must instantiate every dependent template once per distinct shape. This results in large compile times. Compile Time: The Cost of Templates We compiled 48 hopper warp specialized gemm , a single CuTe-based GEMM file of roughly 500 lines, with -c and no benchmark harness, inside an nvidia/cuda:12.5.0-devel-ubuntu22.04 container, invoking nvcc -std=c++17 -arch=sm 90a . The steady-state nvcc time, averaged over two warm runs on a consumer laptop, was:~20.5 s compiling single kernel for single architecture A full CUTLASS build targets several architectures, and the cost multiplies accordingly. NVIDIA’s own bug tracker records 17m22s for two Ampere issue 1042 and i16832gemm s8 kernels approximately two minutes for a 30-line CuTe-DSL kernel issue 2677 . The NVIDIA developer blog post introducing the CuTe Python DSL in November 2025 frames its principal contribution as “up to two orders of magnitude reduced” compile times relative to C++ CUTLASS, with a quoted “~100x compilation speedup” for Blackwell GEMM and “30-50x” for flash attention. C++ templates compile too slowly for the iteration speed kernel authors need. Triton: DSL Embedded Into Python Using Triton is straightforward: decorate a Python function with @triton.jit , mark the compile-time parameters with tl.constexpr , and write the kernel body in something close to NumPy. Triton also simplifies the programming model, focusing on the tile a single thread block operates on, rather than on the code for an individual thread. Triton is a pleasure to use when it works: the program looks straightforward and does what one would expect. Integration into PyTorch is first-class, there is no build system to set up, and it is relatively hard to construct a malformed program that triggers a crash. However, when one wants to write a reusable generic set of libraries in Triton, things get tricky. Parsing Limitations Suppose we want to expose a fused matmul whose epilogue activation, scaling, or fusion applied to the accumulator after the inner product can be supplied by the caller as a Python callable: In a tracing-based framework this is one line: the user hands in a Python callable, and the trace records whatever operations the callable performs. In Triton this is tricky to achieve, and almost every other limitation in this section is a variation of the same underlying constraint. What the Ecosystem Actually Does Before we look at why, it helps to see how production Triton libraries handle this in practice. Across Liger-Kernel https://github.com/linkedin/Liger-Kernel , FlagGems https://github.com/flagos-ai/FlagGems , Quack https://github.com/Dao-AILab/quack , and DeepGEMM https://github.com/deepseek-ai/DeepGEMM , the answer is consistent: enumerate variants statically, dispatch by enum or string tag, never accept a runtime callable. Quack’s GEMM signature is representative: Why Lambdas Fail If we try to pass a callable in Triton say, matmul A, B, activation=lambda x: tl.where x 0, x, 0 , it fails at compile time: Following the error’s advice and wrapping the lambda in triton.jit fails earlier, during construction of the JITFunction object: The constructor calls inspect.getsourcelines fn and expects the returned source to contain a def name line, matched via a module-level regex at jit.py:27 . It uses the def to compute indentation and dedent the body before handing it to the AST walker; a lambda’s source line contains no def , and the construction step fails. The supported workaround is to lift the activation into a named @triton.jit def and pass that through the constexpr argument, which does work. But that is exactly what an enum entry already refers to, and the ecosystem’s convergence on enums reflects this constraint: once the supported shape is “any caller-defined @triton.jit -decorated function passed by name,” the library author may as well enumerate the small set they intend to support, autotune over it, and present a string-typed API. Closures A factory-style helper that captures configuration in an enclosing scope is a common pattern in Python: In Triton this fails unconditionally: Triton does not implement closure capture at all; every free name in a kernel body is resolved against module globals and must already be a tl.constexpr . Higher-Order Primitives The same constraint shows up most sharply in the region-builder APIs of tl.reduce and tl.associative scan : even a properly @triton.jit -decorated combine function cannot be passed through a kernel argument at all. The combine must be resolved lexically in the kernel’s enclosing scope, and the standard library copes by shipping one named combine per parameter combination: argmax combine tie break left and argmax combine tie break fast exist as two file-scoped functions because the boolean argument they differ in cannot be threaded in at call time triton/python/triton/language/standard.py:158-165 . This is arguably more a limitation of the specific region-builder API than of the parsing approach as such; one could imagine a Triton in which tl.reduce accepted callable arguments. But the analogous JAX primitive jax.lax.associative scan lambda x, y: x + y, xs does accept a lambda. Aren’t Those Fixable? Each issue described above is, in principle, fixable. But the picture stays the same: supporting metaprogramming in Triton means implementing more and more of Python inside the AST walker. CuTe-DSL: The Same Pattern, Sharper Edges While writing CuTe-DSL kernels we encountered two such failures and filed bug reports against both. Each illustrates the pattern: Two CuTe-DSL bugs, both from a parsing frontend — cutlass 3268 storage.