{"slug": "accelerating-llm-inference-on-amd-gpus-with-low-latency-gemms", "title": "Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs", "summary": "AMD announced a new kernel family, LDS-Pipelined Split-K GEMM, that accelerates LLM inference on AMD GPUs by optimizing decode-time GEMMs with small M and large N/K dimensions. The technique achieves up to 1.64x average latency improvement over existing libraries on benchmark tests, addressing a key bottleneck in interactive LLM serving.", "body_md": "# Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs[#](#accelerating-llm-inference-on-amd-gpus-with-low-latency-gemms)\n\nLarge language model inference is becoming increasingly interactive. Users expect chatbots, coding assistants, agents, and real-time copilots to respond quickly, stream tokens smoothly, and stay responsive under concurrent load. In that setting, decode-time latency is not just a backend metric. It directly affects perceived quality.\n\nIn this blog, you will explore one small but important part of that inference path: **decode-time GEMMs with small M, large N and K, BF16/FP16 inputs, optional bias, and shapes that repeat across real models**. These shapes can leave conventional GEMM tiling underutilized, which makes them a useful target for decode-path optimization.\n\nThe main technique is **LDS-Pipelined Split-K GEMM**: the long `K`\n\nreduction is split across CTAs, further sliced across warp groups inside each CTA, and kept moving through a multi-stage LDS memory pipeline. On AMD GPUs, LDS means Local Data Share, the on-chip scratchpad memory used for fast cooperation inside a CTA.\n\nYou will also see how we implement this idea as an [AITER](https://github.com/ROCm/aiter) [FlyDSL](https://github.com/ROCm/FlyDSL) kernel family. FlyDSL keeps low-level ROCm™ software details such as MFMA selection, LDS layout, async copies, and synchronization explicit, while still generating shape-specialized variants for the model dimensions that appear in decode. In benchmark sweeps, this targeted decode optimization reaches a ** 1.64x average latency improvement** over the fastest of\n\n`HipblasLT`\n\n, `AITER Triton`\n\n, and `AITER ASM`\n\non the `K = 7168`\n\ndecode grid[[1]](#id3), and a\n\n**on additional BF16 model-shape tests.**\n\n`1.49x`\n\naverage latency improvement## Why Does Decode Latency Matter for LLM Serving?[#](#why-does-decode-latency-matter-for-llm-serving)\n\nLLM serving has two broad phases:\n\n**Prefill**, where the model processes the prompt.** Decode**, where the model generates output tokens one step at a time.\n\nPrefill often has a larger effective `M`\n\nbecause many prompt tokens can be processed together. Decode is different. Each step may only process a small number of active tokens, especially after batching, scheduling, tensor parallelism, and request-level dynamics are taken into account.\n\nThat makes decode performance important for user-facing latency:\n\n**Time to first token** affects how quickly the system appears to respond.**Time per output token** affects streaming smoothness.**Inter-token latency** affects whether the interaction feels fluid.**Throughput under concurrency** affects how many users can be served without hurting responsiveness.\n\nFigure 1 illustrates this interactive decode serving setting and shows where these latency concerns appear in the user-facing path.\n\n*Figure 1: Interactive LLM decode serving.*\n\nFor these workloads, shaving overhead from repeated decode GEMMs can matter at the model-serving level.\n\n## Why Do Small-`M`\n\n, Large-`K`\n\nGEMMs Underperform?[#](#why-do-small-m-large-k-gemms-underperform)\n\nIn large-model decode, GEMM often looks like:\n\n```\nC[M, N] = A[M, K] @ B[N, K]^T\n```\n\nVisually, the kernel still starts from the standard GEMM idea: compute a tile of `C`\n\nfrom a tile of `A`\n\nand a tile of `B`\n\n. The problem is that a small `M`\n\nproduces too few output tiles, even though the `K`\n\ndimension can be very long.\n\nFigure 2 shows this small-`M`\n\n, large-`K`\n\nbottleneck: the output grid is narrow, while the reduction dimension still contains substantial work.\n\n*Figure 2: Small- M, large-K GEMM bottleneck.*\n\nwhere `M`\n\nis the number of active tokens in a decode step or micro-batch. For serving workloads, `M`\n\nis frequently small: `1`\n\n, `2`\n\n, `4`\n\n, `8`\n\n, `16`\n\n, `32`\n\n, sometimes up to `128`\n\nor `256`\n\n. At the same time, `N`\n\nand `K`\n\nare model-hidden-size dimensions and can be thousands or tens of thousands.\n\nThat shape regime is awkward for general GEMM libraries. A conventional large-tile GEMM wants enough `M x N`\n\nwork per block to keep all compute units busy. Decode GEMM often does not provide that naturally. The result is under-occupancy, poor wave utilization, and too much overhead relative to useful math.\n\nCommon GEMM optimizations such as larger CTA tiles, better memory coalescing, LDS staging, MFMA-focused scheduling, and pipelining still matter. But they do not by themselves create enough independent work when the `M x N`\n\noutput grid is small. This is why LDS-Pipelined Split-K combines multiple forms of K parallelism instead of relying on one optimization layer.\n\n## Decode GEMM Shapes in Real Models[#](#decode-gemm-shapes-in-real-models)\n\nThe motivation came directly from model shape traces, not from synthetic square GEMMs.\n\nAcross current LLMs, decode GEMM shapes repeatedly show the same pattern:\n\nModel family |\nTypical decode GEMM pattern |\n|---|---|\nDeepSeek V3 |\n|\nGPT-OSS |\n|\nGLM5 |\n|\nKimi K2 |\nmany skinny decode shapes such as |\nLlama 70B / 450B |\n|\nQwen32B |\n|\n\nThe important observation is not just “small `M`\n\nexists.” It is that **small- M, large-K GEMMs occur everywhere in decode paths**, and they affect end-to-end serving throughput.\n\nSo the design target is a low-overhead GEMM path with:\n\nSmall\n\n`M`\n\nand large`K`\n\nModerate-to-large\n\n`N`\n\nOptional bias support\n\nLow launch overhead\n\nGood occupancy even when\n\n`M`\n\nis tiny\n\n## The Core Idea: LDS-Pipelined Split-K[#](#the-core-idea-lds-pipelined-split-k)\n\nThe kernel treats `K`\n\nas splittable reduction work rather than a private serial loop inside one CTA. For one output tile `C[m_tile, n_tile]`\n\n, the computation is:\n\n```\nC_tile =\n    A[m_tile, K0] @ B^T[K0, n_tile]\n  + A[m_tile, K1] @ B^T[K1, n_tile]\n  + A[m_tile, K2] @ B^T[K2, n_tile]\n  + ...\n```\n\nLDS-Pipelined Split-K exposes those `K0/K1/K2/...`\n\nchunks at three levels:\n\n**Inter-CTA Split-K**: split the full K dimension across multiple CTAs (workgroups).** Intra-CTA K-slice splitting**: split the K tile of one CTA across multiple warp groups inside the block.** Multi-stage LDS pipeline**: pipeline K blocks through ring-buffered LDS stages while overlapping global-to-LDS copies, LDS reads, and MFMA compute.\n\nThese layers solve different problems.\n\nTechnique |\nWhere it adds parallelism |\nWhat it helps |\nWhat it needs |\n|---|---|---|---|\nInter-CTA Split-K |\nAcross CTAs (workgroups) |\nBetter GPU occupancy when |\nGlobal accumulation and synchronization |\nIntra-CTA K-slice splitting |\nInside one CTA |\nBetter use of warp groups for K-heavy tiles |\nLDS staging and local reduction |\nMulti-stage LDS pipeline |\nAcross time inside a CTA |\nOverlap global-to-LDS copies, LDS reads, and MFMA while K blocks advance |\nRing-buffered LDS stages and scheduling |\nLDS-Pipelined Split-K |\nAll three levels |\nMore work across the GPU, more useful work per CTA, and smoother K-block pipelining |\nA coordinated reduction and pipeline path |\n\nFigure 3 shows the full tiled data path for this design. A selected `C`\n\ntile is not produced by one monolithic CTA. The long `K`\n\ndimension is first broken into inter-CTA Split-K ranges, each CTA streams its assigned K blocks through the multi-stage LDS pipeline, and intra-CTA K-slice splitting lets multiple warp groups compute partial accumulations for the same output tile. Those local partials are reduced through LDS before the inter-CTA Split-K partials are accumulated into the final `C`\n\ntile.\n\n*Figure 3: LDS-Pipelined Split-K data path.*\n\n### Inter-CTA Split-K: More CTAs for Small-M GEMM[#](#inter-cta-split-k-more-ctas-for-small-m-gemm)\n\nWhen `M`\n\nis small, the normal `M x N`\n\ntile grid may not launch enough CTAs to saturate the GPU.\n\nInter-CTA Split-K expands the launch grid along K:\n\n```\ngrid = [mn_tiles, split_k]\n```\n\nEach Split-K partition computes a partial sum over a different K range. After that partition finishes its local pipeline work and intra-CTA LDS reduction, the partial result is accumulated into the same output tile.\n\nIn the launch wrapper, inter-CTA Split-K is visible as the second grid dimension:\n\n```\nbm = (m + BLOCK_M - 1) // BLOCK_M\nhgemm_kernel(C, A, B, BIAS, m, semaphore, signal).launch(\n    grid=(bm * N_BLOCKS, SPLIT_K, 1),\n    block=(BLOCK_THREADS, 1, 1),\n    stream=stream,\n)\n```\n\nThis is especially useful for decode shapes like:\n\n```\nM = 1, 2, 4, 8, 16\nN = 2560 / 2880 / 5120\nK = 2880 / 4096 / 7168\n```\n\nWithout this extra Split-K dimension, there may simply not be enough independent work.\n\n### Intra-CTA K-slice Splitting: More Warp-Group Parallelism[#](#intra-cta-k-slice-splitting-more-warp-group-parallelism)\n\nInter-CTA Split-K increases the number of CTAs. Intra-CTA K-slice splitting increases useful work inside one CTA.\n\nThe kernel assigns multiple warp groups to different K slices of the same tile. Each group computes a partial accumulation. At the end of the CTA, those partial results are reduced through LDS before writing back.\n\nThis helps in two ways:\n\nIt increases parallelism for K-heavy tiles.\n\nIt controls register pressure by distributing work across warp groups.\n\n### Multi-Stage LDS Pipeline: Keep K Blocks in Flight[#](#multi-stage-lds-pipeline-keep-k-blocks-in-flight)\n\nThe third layer is temporal. Once a CTA owns a K range, it still has to repeatedly compute:\n\n```\nC_tile += A[m_tile, K_i] @ B^T[K_i, n_tile]\n```\n\nfor many consecutive `K_i`\n\nblocks. Instead of treating those blocks as a serial load-then-compute sequence, the kernel uses `STAGES`\n\nas a ring buffer of LDS tiles. A stage that was filled earlier is consumed by LDS reads and MFMA, while another stage is reused for a future global-to-LDS copy.\n\nIn the `B_TO_LDS`\n\npath, both A and B participate in this LDS ring. The prologue first prefetches `STAGES - 1`\n\nK blocks, and the hot loop then consumes one stage while issuing the copy for a future stage:\n\n```\nfor s in range_constexpr(STAGES - 1):\n    ldg_sts_b_async(ks_begin + s * BLOCK_K, s)\n    ldg_sts_a_async(ks_begin + s * BLOCK_K, s)\n\nfor bki, state in range(0, BLOCK_K_LOOPS - (STAGES - 1), 1, init=init_state):\n    k_offset = state[0]\n    current_stage = fx.Index(state[1])\n    next_stage = (current_stage + 1) % STAGES\n    write_stage = (current_stage + STAGES - 1) % STAGES\n\n    __barrier((STAGES - 2) * LDG_WAIT_COUNT)\n    ldg_sts_b_async(k_offset + (STAGES - 1) * BLOCK_K, write_stage)\n    ldg_sts_a_async(k_offset + (STAGES - 1) * BLOCK_K, write_stage)\n    c_frags_new = ldmatrix_compute_tile_streaming(current_stage, c_frags)\n    hot_loop_scheduler()\n```\n\nThe hot loop then advances the ring one K block at a time. `current_stage`\n\nis the LDS stage being consumed, and `write_stage = current_stage + STAGES - 1`\n\nmodulo `STAGES`\n\nis the stage receiving the future K block. The wait count intentionally leaves newer copies outstanding:\n\n```\n__barrier((STAGES - 2) * LDG_WAIT_COUNT)\n```\n\nThat means the loop waits for the current stage to be safe to read, without draining every in-flight global-to-LDS copy. Conceptually:\n\n```\ncurrent stage : LDS reads feed MFMA for K block i\nwrite stage   : global-to-LDS copy brings K block i + STAGES - 1\nnext stage    : becomes current in the next loop iteration\n```\n\nThe scheduler hints in `hot_loop_scheduler()`\n\norder VMEM, LDS reads, and MFMA instructions so this producer/consumer pipeline keeps moving through the CTA. This pipeline depth is separate from the two K-parallelism knobs: `SPLIT_K`\n\nadds CTAs across K, while `BLOCK_K_WARPS`\n\nsplits a CTA’s K tile across warp groups.\n\nWhen `B_TO_LDS`\n\nis disabled, the pipeline is narrower: A is still staged through LDS, but B fragments are loaded directly from global memory into registers instead of joining the staged LDS ring.\n\n## Single-Launch Split-K Synchronization[#](#single-launch-split-k-synchronization)\n\nInter-CTA Split-K creates a correctness problem: multiple CTAs contribute to the same output tile.\n\nThis kernel uses a lightweight global synchronization protocol with two global buffers:\n\n```\nsignal[]\nsemaphore[]\n```\n\nThe flow is:\n\nThe first Split-K partition initializes the output tile.\n\nIf bias is enabled, it writes bias into\n\n`C`\n\n.Otherwise, it zeroes\n\n`C`\n\n.\n\nAfter initialization, it writes a\n\n`signal`\n\n.Other Split-K partitions spin-wait on that signal before accumulating.\n\nEach Split-K partition computes its partial result.\n\nThe partial result is accumulated into global\n\n`C`\n\nwith atomic add.A semaphore counts how many Split-K partitions have arrived.\n\nThe last arriving partition resets both\n\n`signal`\n\nand`semaphore`\n\n.\n\nFigure 4 shows the same flow as a small protocol among the inter-CTA Split-K partitions:\n\n*Figure 4: Split-K synchronization protocol.*\n\nThis avoids a separate initialization kernel and keeps the entire operation inside one GEMM launch. That matters for decode, where launch overhead and small-kernel overhead are visible at the model level. The protocol relies on two simple correctness invariants:\n\n**No Split-K partition accumulates into** Partition 0 initializes the output tile and publishes`C`\n\nbefore initialization is visible.`signal = 1`\n\n; the other partitions spin-wait on that signal before doing global atomic accumulation.**Synchronization state is reset only after all Split-K partitions arrive.** Each partition increments`semaphore[]`\n\n; the last arriving partition resets both`signal[]`\n\nand`semaphore[]`\n\nfor reuse.\n\nIn the implementation, the protocol stays close to the algorithm. Partition 0 initializes `C`\n\nand publishes the signal:\n\n```\nif const_expr(IS_SPLIT_K):\n    zero_c()\n\n# inside zero_c()\nsignal_ptr = get_llvm_ptr(signal, signal_idx, 4)\nllvm.InlineAsmOp(\n    None,\n    [signal_ptr, arith.constant(1, type=T.i32)],\n    \"global_store_dword $0, $1, off sc0 sc1\",\n    \"v,v\",\n    has_side_effects=True,\n)\n```\n\nEvery Split-K partition later enters the barrier, increments the semaphore, and the last partition clears the state:\n\n```\narrive_idx = llvm.AtomicRMWOp(\n    llvm.AtomicBinOp.add,\n    semaphore_ptr,\n    arith.constant(1, type=T.i32),\n    llvm.AtomicOrdering.monotonic,\n    syncscope=\"agent\",\n    alignment=4,\n).result\n\ncond_ksl = arith.cmpi(\n    arith.CmpIPredicate.eq,\n    fx.Index(arrive_idx),\n    fx.Index(SPLIT_K - 1),\n)\ncond_ksl_if = scf.IfOp(cond_ksl, results_=[], has_else=False)\nwith ir.InsertionPoint(cond_ksl_if.then_block):\n    semaphore_[signal_idx] = arith.constant(0, type=T.i32)\n    signal_[signal_idx] = arith.constant(0, type=T.i32)\n```\n\n## LDS Reduction for Intra-CTA K-slice Splitting[#](#lds-reduction-for-intra-cta-k-slice-splitting)\n\nIntra-CTA K-slice splitting happens inside a CTA. Each K-slice warp group produces partial `C`\n\nfragments. Instead of immediately writing each partial to global memory, the kernel stages the partial results through LDS:\n\n```\npartial C from slice 0\npartial C from slice 1\n...\npartial C from slice K\n        ↓\nLDS reduction\n        ↓\nglobal store or global atomic\n```\n\nWhen inter-CTA Split-K is disabled, the CTA reduces local K-slice partials and stores the final result. When inter-CTA Split-K is enabled, the CTA first reduces its local K-slice partials, then participates in the global accumulation. The implementation makes this hierarchy explicit by giving LDS `C`\n\nstorage an extra `BLOCK_K_WARPS`\n\ndimension:\n\n```\ncs_ = STensor(smem_c_ptr, dtype_, shape=(BLOCK_K_WARPS, BLOCK_M, BLOCK_N))\n```\n\nEach warp group writes its own K-slice partial into `cs_[wid_k, ...]`\n\n. The epilogue then reduces those partials before either storing the tile or participating in inter-CTA Split-K atomic accumulation.\n\n## Memory Pipeline Details[#](#memory-pipeline-details)\n\nFigure 5 summarizes the memory pipeline from the tiled-GEMM view. For one selected output tile, the CTA walks through a stream of K blocks:\n\n```\nC_tile =\n    A_i   @ B_i^T\n  + A_i+1 @ B_i+1^T\n  + A_i+2 @ B_i+2^T\n  + ...\n```\n\nThe multi-stage pipeline does not change this math. It changes **where each K block is while the loop is running**. One LDS stage feeds MFMA for the current K block, another stage can hold the next block, and another can receive a future block through global-to-LDS copy.\n\n*Figure 5: Multi-stage LDS pipeline.*\n\nThis is the memory-side reason the pipeline pairs well with Split-K. Split-K creates more K-parallel work, while the LDS ring keeps each CTA from repeatedly stalling on a simple load-then-compute sequence.\n\n## Implementation Notes: From Algorithm to FlyDSL Kernel[#](#implementation-notes-from-algorithm-to-flydsl-kernel)\n\nLDS-Pipelined Split-K is not one fixed kernel. It is a family of specialized kernels whose best configuration depends on shape, dtype, bias, and GPU architecture.\n\nThis is where FlyDSL matters. The algorithm is expressed as a parameterized kernel generator rather than as one handwritten kernel per shape. The implementation keeps low-level pieces explicit: MFMA selection, LDS allocation, async copies, global atomics, `s_waitcnt`\n\n, barriers, and inline assembly for specific global memory operations. FlyDSL then lets the kernel specialize the tile shape, Split-K factor, memory path, and epilogue together.\n\nIn `splitk_hgemm.py`\n\n, the naming maps directly to implementation knobs: `SPLIT_K`\n\ncontrols inter-CTA K partitioning, `BLOCK_K_WARPS`\n\ncontrols intra-CTA K-slice parallelism, `STAGES`\n\ncontrols the depth of the LDS ring, and `B_TO_LDS`\n\ncontrols whether B is staged through that ring. The kernel family is parameterized directly in the builder:\n\n``` python\n@functools.lru_cache(maxsize=1024)\ndef compile_hgemm_kernel(\n    dtype: str,\n    n: int,\n    k: int,\n    TILE_M: int = 128,\n    TILE_N: int = 128,\n    TILE_K: int = 64,\n    STAGES: int = 2,\n    SPLIT_K: int = 1,\n    BLOCK_M_WARPS: int = 2,\n    BLOCK_N_WARPS: int = 2,\n    BLOCK_K_WARPS: int = 1,\n    B_TO_LDS: bool = False,\n    HAS_BIAS: bool = False,\n):\n    IS_SPLIT_K = SPLIT_K > 1\n    IS_SLICE_K = BLOCK_K_WARPS > 1\n```\n\nThose parameters are the tuning surface:\n\n``` php\nTILE_M / TILE_N / TILE_K  -> CTA tile shape\nSPLIT_K                  -> global K parallelism across CTAs\nBLOCK_K_WARPS            -> intra-CTA K-slice parallelism\nB_TO_LDS                 -> whether B is staged through LDS\nHAS_BIAS                 -> fused bias path\ndtype + GPU_ARCH         -> MFMA instruction selection\n```\n\nThis gives the implementation a middle ground: more shape-specific control than a generic GEMM library call, but faster iteration than maintaining a fully hand-written assembly kernel.\n\n**The kernel is generated as a family of specialized kernels.** Each shape can JIT to the right tile, inter-CTA Split-K factor, intra-CTA K slicing, LDS policy, MFMA path, and bias path.**The synchronization logic stays connected to the algorithm.** Split-K initialization, signal wait, semaphore reset, LDS reduction, and epilogue logic are written in one kernel instead of being scattered across several auxiliary launches.**The compiler can specialize aggressively.** Branches like`HAS_BIAS`\n\n,`B_TO_LDS`\n\n,`SPLIT_K > 1`\n\n,`BLOCK_K_WARPS > 1`\n\n, and architecture-specific MFMA paths become compile-time constants.**Tuning moves faster than hand-written assembly iteration.** For model-serving kernels, this matters. We need to test many real model shapes, not just one benchmark shape.\n\nThat implementation strategy matters because the algorithm needs many tuned variants, not one universal kernel.\n\n## Benchmark Results[#](#benchmark-results)\n\nWe evaluate LDS-Pipelined Split-K as a concrete BF16/FP16 GEMM kernel family on representative decode GEMM shapes. The first sweep uses `K = 7168`\n\non an AMD Instinct™ MI355X GPU (`gfx950`\n\n) with `256`\n\nCUs, and compares four backend paths:\n\n`AITER FLYDSL`\n\n`AITER ASM`\n\n`HipblasLT`\n\n`AITER Triton`\n\nIn Figures 6 through 11, `AITER FLYDSL`\n\nis the FlyDSL-generated kernel path, `AITER ASM`\n\nis the tuned assembly path, `HipblasLT`\n\nis the library backend, and `AITER Triton`\n\nis the Triton-based backend.\n\nAll performance data in this section was measured on the benchmark setup described in the footnote.[[1]](#id3)\n\nThe benchmark section should be read shape by shape. The goal is to show where `AITER FLYDSL`\n\n, the generated shape-specialized kernel path, improves decode latency, where it is merely competitive, and where another backend remains the better choice.\n\nFigure 6 is the main visual speedup table. Each cell corresponds to one `(M, N, K)`\n\nshape. The large number is the speedup of `AITER FLYDSL`\n\nagainst the fastest of `HipblasLT`\n\n, `AITER Triton`\n\n, and `AITER ASM`\n\n, and the small text shows the measured latencies.\n\n*Figure 6: K = 7168 speedup table.*\n\nFor each `(M, N, K)`\n\nshape, the baseline is the fastest of `HipblasLT`\n\n, `AITER Triton`\n\n, and `AITER ASM`\n\n, and the cell value is:\n\n```\nspeedup = min(hipblaslt_latency, aiter_triton_latency, aiter_asm_latency) / aiter_flydsl_latency\n```\n\nAcross these 32 decode GEMM shapes, `AITER FLYDSL`\n\nimproves the average latency against the fastest of `HipblasLT`\n\n, `AITER Triton`\n\n, and `AITER ASM`\n\nby about `1.64x`\n\n. For the most decode-sensitive region, `M <= 8`\n\n, the average speedup is about `1.79x`\n\n, with the best observed shape reaching about `2.37x`\n\n. Some shapes remain close because backend performance depends on the exact `(M, N, K)`\n\ngeometry and how much useful K-parallel work the shape exposes.\n\nCompared with the `AITER ASM`\n\npath, `AITER FLYDSL`\n\nis also competitive across the sweep, with an average speedup of about `1.44x`\n\n, and with a best observed speedup of around `1.97x`\n\n. A few shapes remain close, which is expected because the best backend depends on the exact `(M, N, K)`\n\ntile geometry and reduction balance.\n\nFigure 7 shows the fastest backend for each shape directly. This is useful as a sanity check because it answers a simpler question: which path wins this shape?\n\n*Figure 7: K = 7168 fastest-backend table.*\n\nFor readers who want to inspect the raw latency trend, Figure 8 keeps the original backend-by-backend comparison. Each panel fixes `N`\n\n, while the x-axis changes `M`\n\nfrom `1`\n\nto `128`\n\n. These curves are useful because they show where `AITER FLYDSL`\n\nwins cleanly and where `AITER ASM`\n\nremains close.\n\n*Figure 8: K = 7168 latency curves.*\n\nThe same benchmark data also includes an additional BF16 model-shape sweep beyond the regular `K = 7168`\n\ngrid. These shapes cover projection sizes such as `N = 128`\n\n, `640`\n\n, `2112`\n\n, `2880`\n\n, and `5120`\n\n, with `K = 2048`\n\n, `2880`\n\n, `4096`\n\n, or `7168`\n\nand both bias and no-bias cases. Figure 9 keeps the same visual convention, but groups rows by `(N, K, bias)`\n\nso the reader can compare families rather than isolated points.\n\n*Figure 9: BF16 model-shape speedup table.*\n\nAcross these 48 additional model-shape GEMMs, `AITER FLYDSL`\n\nimproves the average latency against the fastest of `HipblasLT`\n\n, `AITER Triton`\n\n, and `AITER ASM`\n\nby about `1.49x`\n\n. For `M <= 8`\n\n, the average speedup is about `1.60x`\n\n, with the best observed shape reaching about `2.34x`\n\n.\n\nFigure 10 shows the fastest backend for the same BF16 model-shape sweep, making it easier to see which path wins each grouped shape.\n\n*Figure 10: BF16 model-shape fastest-backend table.*\n\nFor raw latency comparison, Figure 11 fixes one model-shape family per panel and sweeps `M`\n\nfrom `1`\n\nto `128`\n\n.\n\n*Figure 11: BF16 model-shape latency curves.*\n\n## Summary[#](#summary)\n\nIn this blog, you explored why decode-time GEMMs are a critical path for interactive LLM serving, especially when small `M`\n\n, large `K`\n\n, and repeated model shapes leave conventional GEMM tiling short of parallel work.\n\nYou also saw how LDS-Pipelined Split-K addresses that gap with three cooperating layers: inter-CTA Split-K for more global work, intra-CTA K-slice splitting for better warp-group utilization, and a multi-stage LDS pipeline that keeps K blocks moving through memory and compute. The single-launch signal/semaphore protocol ties these layers together without auxiliary kernels.\n\nOn the MI355X GPU, the generated FlyDSL kernel family delivers about `1.64x`\n\nimprovement over the fastest of the `HipblasLT`\n\n, `AITER Triton`\n\n, and `AITER ASM`\n\nbaselines on the `K = 7168`\n\ndecode grid, and about `1.49x`\n\non the broader BF16 model-shape sweep. These results show why decode-heavy inference stacks benefit from shape-specialized kernels that treat `M`\n\n, `K`\n\n, and `N`\n\nas first-class tuning dimensions.\n\nLDS-Pipelined Split-K is available as part of AITER. In future work, the team plans to extend this FlyDSL kernel family to additional model architectures, quantized dtypes, and mixed precision paths, and to share more practical guidance for tuning low-latency inference kernels on AMD GPUs.\n\n## Disclaimers[#](#disclaimers)\n\nThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, ROCm, Instinct, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. © 2026 Advanced Micro Devices, Inc. All rights reserved", "url": "https://wpnews.pro/news/accelerating-llm-inference-on-amd-gpus-with-low-latency-gemms", "canonical_source": "https://rocm.blogs.amd.com/software-tools-optimization/accelerating-llm-inference-on-amd-gpus-with-low-latency-gemms/README.html", "published_at": "2026-06-30 19:03:29+00:00", "updated_at": "2026-06-30 19:20:40.834364+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-chips"], "entities": ["AMD", "ROCm", "AITER", "FlyDSL", "HipblasLT", "Triton"], "alternates": {"html": "https://wpnews.pro/news/accelerating-llm-inference-on-amd-gpus-with-low-latency-gemms", "markdown": "https://wpnews.pro/news/accelerating-llm-inference-on-amd-gpus-with-low-latency-gemms.md", "text": "https://wpnews.pro/news/accelerating-llm-inference-on-amd-gpus-with-low-latency-gemms.txt", "jsonld": "https://wpnews.pro/news/accelerating-llm-inference-on-amd-gpus-with-low-latency-gemms.jsonld"}}