A Fast Attention Kernel for MI300X, Written in Hip, Not Assembly A team of kernel engineers developed a bf16 forward attention kernel for AMD MI300X GPUs using HIP, outperforming AMD's own AITER v3 library by up to 1.26× across various token lengths and rounding modes. The kernel achieves speedups through careful pipeline scheduling, memory placement (K in LDS, V in L1, Q and accumulators in registers), and techniques like 3Q tiling and tail KV split. The open-source code is available on GitHub. A Fast Attention Kernel for MI300X, Written in HIP, Not Assembly A deep dive into building a bf16 forward attention kernel for AMD MI300X in HIP, using instruction level control without dropping into a full assembly codebase. TL;DR - A bf16 forward attention kernel for AMD MI300X, written in HIP rather than full hand assembly. - The core technique is one-instruction asm wrappers: choose exact opcodes while leaving register allocation to the compiler. - The design plans the CDNA3 pipeline directly: eight waves, two groups, and two carefully placed barriers. - Most of the speedup comes from memory placement: K in LDS , V kept hot in L1 , Q and accumulators in registers. - 3Q tiling and the tail KV split turn a near tie into wins across the reported sweep. - Code is open-source under MIT and can be found here https://github.com/moonmath-ai/amd-kernels . - We would like to thank HotAisle https://hotaisle.xyz/ , our AMD cloud, for bare metal access and general support. A bf16 forward attention kernel for AMD MI300X gfx942 , written in HIP rather than assembly, that beats AITER v3 on every shape and every rounding mode. Geomean 1.18×, 1.15×, 1.08× RTNE, RTNA, RTZ , up to 1.26×, across an 8K to 128K sweep, and 1.37× to 1.59× versus Modular MAX. 1. What we built, and the idea behind it Attention is the fused softmax QKᵀ/√d ·V operation at the core of every transformer. As part of our kernel engineering work, we built a full BF16 forward-attention kernel for AMD MI300X CDNA3, gfx942 . In our benchmarks, it outperformed the best existing AMD implementation on MI300X. This write-up provides the technical details, lessons, and some techniques we found useful and can be reused. AITER https://github.com/ROCm/aiter is AMD's own optimized kernel library, and its v3 forward attention is written by hand in GCN assembly. Against it, our kernel wins on every shape and every rounding mode across an 8K to 128K token sweep. The geomean is 1.18×, 1.15×, 1.08× for RTNE, RTNA, and RTZ rounding, and 1.37× to 1.59× against Modular MAX. RTNE and RTNA lead by 1.11× to 1.26×. RTZ is AITER's own fastest mode, the tightest race, and there we win by a slimmer 1.03× to 1.11×. For a rigidly structured computation like attention, the way to win is to make the scheduling decisions yourself instead of leaving them to the compiler. The compiler's job is to be a good generalist, and a fused attention kernel is not a general program. It is a pipeline whose exact shape we already know, down to which instruction should issue when, what should be in flight, and where each barrier goes. Almost every section here takes one of those decisions in hand. There is one honest exception, in §3.1, where the smarter move is the opposite: hand the compiler an easier problem and let it do the overlap itself. We flag it when we get there. This blog post is structured as follows: the toolkit §2 is the handful of barely documented techniques that make instruction level control possible from HIP at all. The architecture §3 covers why eight waves, how the two halves overlap, and which matrix core instruction we use and why. The memory hierarchy §4 is what lives in registers, LDS , and L1 , and the lengths we go to to keep it there. The two advanced wins §5 turn a near tie into a clean sweep. The full results are in §6. 2. The toolkit 2.1 Templated asm functions Everything that follows rests on one technique, and it is the one we found least documented anywhere, so we start here. Every HIP kernel author hits the same dilemma when they want a specific machine instruction. Compiler intrinsics builtin amdgcn keep you in tidy C++, but the compiler then feels free to help in ways you did not ask for. It may rename your accumulator into a fresh register and add a v mov to copy it back, re-zero an operand you know is already zero, or reorder the instruction against its neighbours. Drop to raw inline assembly and you get exactly the instruction you asked for, but AMD's own guidance is blunt about the price: inline asm is "not recommended, since the compiler does not look at the semantics of the inlined instructions, and may not take care of data hazards" AMD matrix cores lab note https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/ . And now you are managing registers and addresses by hand too. The way through is a small pattern. Wrap exactly one instruction in a device forceinline function and describe its operands with GCC or Clang extended asm constraints. You dictate the opcode, and the compiler still allocates the registers and tracks the data flow. 1 Here is the MFMA we issue hundreds of thousands of times: // in/out tied to the SAME VGPR → no accumulator rename, no v mov copy. device forceinline void asm mfma bf16x4 t a, bf16x4 t b, fp32x4 t& c { asm volatile "v mfma f32 16x16x16 bf16 %0, %1, %2, %0" : "+v" c : "v" a , "v" b ; } The whole trick is in the constraint string. "+v" c says that c is a read-write VGPR operand, so the accumulator's input and output land in the same register and the compiler emits no copy. "v" a and "v" b are VGPR inputs it places wherever it likes. A sibling, asm mfma zero , feeds a literal 0 as the C operand and declares its result "=&v" , write only and early clobber, to start a fresh accumulator chain without even the zeroing v mov . The same pattern folds compile time constants into the instruction. Our paired LDS read takes its two offsets as template parameters and emits them as literals: template