{"slug": "writing-high-performance-kernels-in-tilelang-from-gemm-to-mla", "title": "Writing High-Performance Kernels in TileLang, from GEMM to MLA", "summary": "TileLang introduces a middle-ground approach for writing high-performance GPU kernels, offering explicit control over shared memory allocation, pipeline staging, and warp partitioning through Python code while automating layout inference. The framework enables developers to write production-grade kernels like DeepSeek's MLA decode, starting from a simple GEMM example, by explicitly declaring tile-level operations rather than relying on compiler-managed decisions as in Triton or requiring the template-heavy approach of CUTLASS/CuTe.", "body_md": "If you write GPU kernels, you live somewhere on a spectrum. At one end is Triton: quick to write, but the compiler makes most of the layout and shared-memory decisions for you. At the other end is CUTLASS / CuTe: total control, at the cost of a lot of template machinery. TileLang sits in the middle. You write Python, but you say explicitly what lives in shared memory, how the pipeline is staged, and how warps split the work — and a layout inference pass fills in the rest.\n\nIn this post we'll cover the mental model, write a GEMM, and then build up to a real production kernel: DeepSeek's MLA decode, where the interesting decisions actually show up. The goal is not to be exhaustive. It's to show what you think about tiles, and where TileLang quietly does the hard parts for you. We'll finish with a more typical story from production — a kernel where the win wasn't speed at all.\n\nHere's the whole idea in three points.\n\n`block_M × block_K`\n\n, say) is owned and operated on by a thread block, a warp, or a thread. You stop thinking purely at the thread-block level the way you do in Triton, and you stop hand-managing individual threads the way you do in CUDA.`T.alloc_shared`\n\n), what goes to registers (`T.alloc_fragment`\n\n), and what's thread-local. This is the biggest difference from Triton, which hides shared-memory allocation and staging inside the compiler.If you're coming from Triton, here's the rough mapping.\n\n| Triton | TileLang | |\n|---|---|---|\n| Granularity | thread block + implicit vectorization | tile (block / warp / thread) |\n| Shared memory | managed by the compiler | explicit alloc_shared + copy |\n| Layout | the compiler decides | inferred, but you can annotate |\n| Pipelining | tl.range + compiler | explicit T.Pipelined(num_stages=) |\n| Tensor Core | tl.dot | T.gemm with a selectable warp policy |\n| Backends | NVIDIA (mainly) / AMD | NVIDIA / AMD / CPU / WebGPU / CuTeDSL, plus Ascend & MUSA forks |\n\nThe short version: if you want fine control over blocking, pipeline depth, and warp partitioning without writing CUTLASS, TileLang is the sweet spot. For simple elementwise or light fusion, Triton is still quicker to reach for.\n\n```\nconda create -n tilelang python=3.10 -y\nconda activate tilelang\npip install tilelang                 # prebuilt wheel, easiest path\n```\n\nIf you're going to touch the compiler passes, build from source instead (you'll need a local LLVM/CUDA toolchain):\n\n```\ngit clone --recursive https://github.com/tile-ai/tilelang.git\ncd tilelang && pip install -r requirements-dev.txt\npip install -e . -v --no-build-isolation\n```\n\nWe'll start with the kernel everyone starts with: `C = ReLU(A @ B)`\n\n. It's small, but it touches every primitive that matters — explicit buffers, parallel copy, software pipelining, the Tensor Core call, and an L2 swizzle.\n\n``` python\nimport tilelang\nimport tilelang.language as T\nimport torch\n\n@tilelang.jit\ndef matmul(M, N, K, block_M, block_N, block_K,\n           dtype=\"float16\", accum_dtype=\"float\"):\n\n    @T.prim_func\n    def matmul_relu_kernel(\n        A: T.Tensor((M, K), dtype),\n        B: T.Tensor((K, N), dtype),\n        C: T.Tensor((M, N), dtype),\n    ):\n        # grid dims: (#blocks along N, #blocks along M); 128 threads per block\n        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M),\n                      threads=128) as (bx, by):\n\n            # Say where each tile lives, explicitly.\n            A_shared = T.alloc_shared((block_M, block_K), dtype)         # shared memory\n            B_shared = T.alloc_shared((block_K, block_N), dtype)\n            C_local  = T.alloc_fragment((block_M, block_N), accum_dtype) # register accumulator\n\n            T.use_swizzle(panel_size=4, order=\"col\")   # optional: better L2 reuse\n            T.clear(C_local)                           # zero the accumulator\n\n            for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):\n                T.copy(A[by * block_M, ko * block_K], A_shared)   # global -> shared\n                T.copy(B[ko * block_K, bx * block_N], B_shared)\n                T.gemm(A_shared, B_shared, C_local)               # tile-level MMA\n\n            for i, j in T.Parallel(block_M, block_N):             # fused ReLU\n                C_local[i, j] = T.max(C_local[i, j], 0)\n\n            T.copy(C_local, C[by * block_M, bx * block_N])        # write back\n\n    return matmul_relu_kernel\n\nM = N = K = 1024\nkernel = matmul(M, N, K, block_M=128, block_N=128, block_K=64)\na = torch.randn(M, K, device=\"cuda\", dtype=torch.float16)\nb = torch.randn(K, N, device=\"cuda\", dtype=torch.float16)\nc = torch.empty(M, N, device=\"cuda\", dtype=torch.float16)\nkernel(a, b, c)\n\ntorch.testing.assert_close(c, torch.relu(a @ b), rtol=1e-2, atol=1e-2)\nprint(\"gemm ok\")\n```\n\nHere is what each piece is doing.\n\n`A_shared`\n\nand `B_shared`\n\nlive in shared memory; `C_local`\n\nlives in registers. Accumulator in registers, operands staged through shared memory — that's the standard GEMM recipe, except here `T.copy`\n\nis sugar for a parallel copy.`T.Parallel`\n\n-style move, and the compiler derives a vectorized, coalesced global→shared transfer from it. When the copy sits inside `T.Pipelined`\n\n, it becomes `cp.async`\n\nautomatically.`T.Pipelined(extent, num_stages=N)`\n\nis a software pipeline.`num_stages=3`\n\nmeans triple buffering — while you compute K-tile `ko`\n\n, the loads for `ko+1`\n\nand `ko+2`\n\nare already in flight. In Triton, this is a compile flag; here it's just the loop, which is easier to reason about.`T.gemm(A, B, C)`\n\nis the tile-level matmul.`transpose_A`\n\n/ `transpose_B`\n\nand a `policy=T.GemmWarpPolicy.*`\n\nthat controls how warps split the output tile. Hold onto that policy argument — it's the whole story when we get to MLA.`T.use_swizzle`\n\nThe figure below maps all of this onto the hardware. It's worth reading against the code, because the labeled spots are exactly where TileLang hands you control that Triton keeps for itself.\n\n[Figure: GEMM in TileLang — you place every buffer in the hierarchy yourself. A_shared / B_shared sit in shared memory, C_local accumulates in registers across warps W0–W3, and the K-loop pipeline (num_stages=3) overlaps cp.async prefetches with the current gemm compute.]\n\nYou can write most kernels with a small vocabulary.\n\n`T.alloc_shared`\n\n, `T.alloc_fragment`\n\n(registers), `T.alloc_local`\n\n.`T.copy(src, dst)`\n\nbetween any two levels; `T.clear`\n\n, `T.fill`\n\n.`T.gemm(...)`\n\n; `T.Parallel(d0, d1, ...)`\n\nfor elementwise loops (this is the entry point for layout inference); `T.reduce_max`\n\n/ `T.reduce_sum`\n\n; scalar math like `T.exp`\n\n, `T.exp2`\n\n, `T.max`\n\n, `T.infinity`\n\n.`T.Pipelined(extent, num_stages=)`\n\n, `T.use_swizzle(...)`\n\n, `T.annotate_layout(...)`\n\nwhen you need a specific layout (bank-conflict avoidance, custom swizzle).`M = T.dynamic(\"m\")`\n\nso you don't recompile per shape (it's called `T.symbolic`\n\nin some versions).Two things you'll want often. To see what the compiler actually emitted:\n\n```\nprint(kernel.get_kernel_source())     # generated CUDA / HIP\n```\n\nAnd to time it:\n\n```\nprofiler = kernel.get_profiler(tensor_supply_type=tilelang.TensorSupplyType.Normal)\nprint(f\"latency: {profiler.do_bench()} ms\")\n```\n\n`T.print(buf)`\n\nprints a tile from inside the kernel, and the repo's `examples/plot_layout`\n\ndraws the memory layout, which is handy when you're chasing a bank conflict or checking a swizzle.\n\nThe GEMM shows the mechanics. This next one shows why they matter. We'll walk through DeepSeek's MLA (Multi-Head Latent Attention) decode kernel, because it's the cleanest example of TileLang earning its keep. The TileLang reference lands at roughly FlashMLA's H100 performance (benchmarked at batch 64/128 in fp16, comfortably ahead of Triton and FlashInfer) in about 80 lines of Python. The interesting question is how, because the hard part of MLA isn't the math — it's register pressure.\n\nLet's review the loop everyone knows. Every FlashAttention-family kernel has the same shape. Per query block, you stream over key/value blocks and keep a running max and denominator, so the full score matrix never lands in memory:\n\n```\n# acc_s : [block_M, block_N]  scores for this KV block\n# acc_o : [block_M, dim]      output accumulator\nfor i in range(num_kv_blocks):\n    acc_s = Q @ K[i].T\n    m_prev       = scores_max\n    scores_max   = max(m_prev, rowmax(acc_s))\n    scores_scale = exp(m_prev - scores_max)\n    acc_o *= scores_scale                       # rescale prior output\n    acc_s  = exp(acc_s - scores_max)            # probabilities\n    acc_o += acc_s @ V[i]\n```\n\nBoth `acc_s`\n\nand `acc_o`\n\nwant to stay in registers. For MHA or GQA, that's fine. For MLA, it isn't.\n\n**Here's where it gets hard.** MLA's head dimensions are big: query and key are 576 wide (a 512-wide \"nope\" part with no positional encoding, plus a 64-wide \"rope\" part), and value is 512. So `acc_o = [block_M, 512]`\n\n, and it has to stay resident in registers across the whole KV loop.\n\nNow bring in the hardware. On Hopper, the fast path is `wgmma.mma_async`\n\n, which ties 4 warps (128 threads) into one warpgroup and requires a minimum M of 64. So the smallest M one warpgroup can own is 64, which means one warpgroup would be holding a `64 × 512`\n\naccumulator. That's too big for a single warpgroup's register file. It spills, and performance falls off a cliff.\n\n[Figure: MLA decode in TileLang — splitting acc_o across two warpgroups. WG0 and WG1 each compute Q·K^T (policy=FullCol), exchange their score halves through S_shared, and then each compute their column slab of P·V into acc_o_L / acc_o_R. The whole bookkeeping (acc_s shape, S_shared shape, Q·K split) is derived by layout inference from the FullCol policy you annotated.]\n\n**The fix is to split the output across two warpgroups.** You can't shrink M below 64, so the only axis left is `dim`\n\n. Use two warpgroups: `WG0`\n\nowns `acc_o[:, :256]`\n\n, `WG1`\n\nowns `acc_o[:, 256:]`\n\n. Now each one holds a `64 × 256`\n\naccumulator, which fits. That creates a second problem, though: the `P @ V`\n\nstep (with `policy=FullCol`\n\n, each warpgroup producing one column slab of the output) needs the *complete* `acc_s`\n\n, but in `Q @ K`\n\neach warpgroup only naturally computed half of it. The resolution is a shared-memory swap. During `Q @ K`\n\n, each warpgroup writes its half of `acc_s`\n\nto shared memory and reads back the other warpgroup's half, so afterward both hold the full `acc_s`\n\nand can each compute their slab of `acc_o`\n\n. The diagram above is exactly that: split the scores, swap through `S_shared`\n\n, split the output.\n\nIn CuTe you'd hand-write the layouts, the swizzles, the Tensor Core alignment, and the producer/consumer sync to pull this off. The reason it collapses to ~80 lines here is layout inference.\n\n**Let's break down what layout inference does.** You annotate intent on the `T.gemm`\n\ncalls, and it propagates the constraints through the program for you:\n\n`policy=FullCol`\n\non `P @ V`\n\nmeans each warpgroup needs the full `acc_s`\n\n, so `acc_s = [block_M, block_N]`\n\n.`S_shared`\n\nin `T.copy(S_shared, acc_s)`\n\nis also `[block_M, block_N]`\n\n.`Q @ K`\n\n: with `FullCol`\n\n, each warpgroup's score slab is `[block_M, block_N/2]`\n\n.The key insight is that you never write any of those shapes. You pick the warp policy and write the math; the shapes, the swizzled layouts, and the warp-specialized producer/consumer code all come out of inference.\n\n**The kernel skeleton.** In MLA decode the query splits into a \"nope\" part (`Q`\n\n, dim 512) and a \"rope\" part (`Q_pe`\n\n, dim 64), and the compressed latent serves as both K and V. So the score is a sum of two GEMMs, and the output is one more. The inner loop looks like this (a representative skeleton, not line-exact — see `example_mla_decode.py`\n\n):\n\n```\n# acc_s = Q_nope @ KV^T + Q_rope @ K_pe^T\nT.gemm(Q_shared,    KV_shared,   acc_s, transpose_B=True,\n       policy=T.GemmWarpPolicy.FullCol, clear_accum=True)\nT.gemm(Q_pe_shared, K_pe_shared, acc_s, transpose_B=True,\n       policy=T.GemmWarpPolicy.FullCol)\n\n# online softmax\nT.copy(scores_max, scores_max_prev)\nT.fill(scores_max, -T.infinity(accum_dtype))\nT.reduce_max(acc_s, scores_max, dim=1, clear=False)\n# ... exp, rescale acc_o by scores_scale, reduce_sum into logsum ...\n\n# acc_o += P @ V  (V is the same latent KV)\nT.copy(acc_s, acc_s_cast)\nT.gemm(acc_s_cast, KV_shared, acc_o, policy=T.GemmWarpPolicy.FullCol)\n```\n\nThe `S_shared`\n\nexchange between the two warpgroups is the part inference inserts for you, once the `FullCol`\n\npolicies force `acc_s`\n\nto be full per warpgroup.\n\n**The nice part: the optimizations are one line each.** This is where TileLang pays off — the whole performance toolkit is one-liners, and the messy lowering is handled for you.\n\n`T.use_swizzle(panel_size, order=\"row\")`\n\n.`T.annotate_layout({S_shared: T.layout.make_swizzled_layout(S_shared)})`\n\n— XOR-style address remapping so concurrent accesses spread across banks instead of serializing.`mbarrier`\n\nsync generated. None of it shows up in your code.`T.Pipelined(range, num_stages)`\n\noverlaps loads with compute — more stages, more overlap, but more shared memory, so it's a knob.`num_split`\n\nparameter plus a combine kernel.So the genuinely hard reasoning — register budget against the M≥64 floor, who owns what across warpgroups, the shared-memory swap — you express by choosing a policy and writing the math. Everything that would be hundreds of fragile lines in CuTe is inference and codegen. That's the pitch, and MLA is where it's most convincing.\n\nThe last example is one of our own production kernels at AtlasCloud, from the Wan video-generation VAE on H100/H200. It's a great illustration of the other thing TileLang is excellent at: covering a config a hand-tuned kernel can't reach, with a clean drop-in.\n\n**The setup.** We already ship a hand-tuned fused RMSNorm + SiLU kernel. It's fast, and it's compiled for the hidden dims `D ∈ {96, 192, 384}`\n\nthat one model config uses. A newer config needs channel widths like `{160, 256, 320, 512, 640, 1024}`\n\n, so on that config the hand-tuned fast path can't run. We wrote a TileLang drop-in to cover exactly that gap.\n\n**The TileLang kernel.** A drop-in with the same interface (BTHWC in/out, same math, same `eps`\n\n) that supports any `C`\n\nthat's a multiple of 32. Two passes, fully coalesced, FP32 accumulator:\n\n``` python\n@T.prim_func\ndef main(X:     T.Tensor((M, C), dtype),      # M = B*T*H*W rows\n         gamma: T.Tensor((C,),  dtype),\n         Y:     T.Tensor((M, C), dtype)):\n    with T.Kernel(T.ceildiv(M, BLOCK_M), threads=128) as bm:\n        X_chunk = T.alloc_shared((BLOCK_M, BLOCK_C), dtype)\n        ss      = T.alloc_fragment((BLOCK_M,), accum_dtype)   # FP32 sum-of-squares\n        # pass 1: loop over C in BLOCK_C chunks, accumulate sum of squares in FP32\n        # rinv = rsqrt(ss / C + 1e-5)\n        # pass 2: re-load X, y = silu(x * gamma * rinv), write back\n```\n\n`BLOCK_C`\n\nis 128/64/32 depending on `C`\n\n, to respect the TMA `boxDim ≤ 256`\n\nlimit, and the FP32 accumulator keeps the sum of squares from overflowing in FP16. Dispatch keeps the hand-tuned path where it works and only falls back when it has to:\n\n``` python\n_ATLAS_SUPPORTED_D = (96, 192, 384)\n\ndef rms_silu_dispatch(x, gamma, out):\n    if x.shape[-1] in _ATLAS_SUPPORTED_D:\n        atlas_rms_norm_silu(x, gamma, out=out)        # keep the hand-tuned path\n    else:\n        tilelang_rms_silu_bthwc(x, gamma, out=out)    # cover the gap\n```\n\n**What it gained us.** All upside, and it's a true drop-in — same interface, same math, same `eps`\n\n, so it slots in behind the existing dispatch with no call-site changes.\n\n| What | Gain |\n|---|---|\n| Previously-unsupported config |\n0 → 1 — it runs now (the headline win) |\n| Attention-block RMSNorm vs the eager PyTorch norm it replaced | 42 μs → 20 μs (~2× faster)\n|\n| End-to-end VAE at production resolution (720×1280, 21 frames) | ~1.79× encode, ~1.78× decode |\n\nThe first row is the real point: TileLang let us serve a model config that previously had no fast path at all, without touching the hand-tuned kernel that already works for the other config. One drop-in, written in Python, and a whole model path went from \"throws\" to \"ships.\"\n\n`T.use_swizzle`\n\n, `T.annotate_layout`\n\n, `T.Pipelined`\n\n, warp specialization, split-KV — with the lowering handled for you.The cool part of TileLang is that the hard reasoning stays in your head, not in boilerplate. You decide how to split work across warps, where buffers live, and how deep the pipeline runs — and then layout inference and warp specialization turn that into the register layouts, the swizzles, and the producer/consumer dance that would otherwise be hundreds of lines of CuTe. You pick a policy and write the math. That's the whole pitch, and it's why an 80-line MLA kernel can sit next to a hand-tuned CUTLASS one.", "url": "https://wpnews.pro/news/writing-high-performance-kernels-in-tilelang-from-gemm-to-mla", "canonical_source": "https://dev.to/atlas_cloud_ai/writing-high-performance-kernels-in-tilelang-from-gemm-to-mla-13p0", "published_at": "2026-05-26 08:50:38+00:00", "updated_at": "2026-05-26 09:04:36.349407+00:00", "lang": "en", "topics": ["ai-tools", "ai-infrastructure", "ai-chips", "machine-learning", "large-language-models"], "entities": ["TileLang", "Triton", "CUTLASS", "CuTe", "DeepSeek"], "alternates": {"html": "https://wpnews.pro/news/writing-high-performance-kernels-in-tilelang-from-gemm-to-mla", "markdown": "https://wpnews.pro/news/writing-high-performance-kernels-in-tilelang-from-gemm-to-mla.md", "text": "https://wpnews.pro/news/writing-high-performance-kernels-in-tilelang-from-gemm-to-mla.txt", "jsonld": "https://wpnews.pro/news/writing-high-performance-kernels-in-tilelang-from-gemm-to-mla.jsonld"}}