{"slug": "why-is-pytorch-compile-so-fast-kernel-fusion", "title": "Why Is PyTorch Compile So Fast: Kernel Fusion", "summary": "PyTorch's Inductor compiler uses kernel fusion to accelerate model execution by up to 10x, grouping dependent operations into single Triton kernels to reduce memory traffic and kernel launch overhead. In a pointwise fusion example, the compiler cut memory operations from eight to four by eliminating intermediate buffers and keeping data in GPU registers. This technique, including reduction and GEMM+epilogue fusion, addresses the two primary slowdowns in GPU computation: data movement and kernel startup costs.", "body_md": "### Featured projects\n\nWhen you use PyTorch’s compiler, your model runs faster, up to 10x faster. But what’s actually happening? Without compilation, the GPU runs a kernel, a function on the GPU, for each torch operation in your code. This creates two big slowdowns: the time spent moving data in memory, and the overhead of starting each new kernel. Every time the GPU launches a kernel, it pays an overhead cost, and every intermediate result means writing to and reading from memory.\n\nThis is where fusion comes in. PyTorch’s Inductor compiler automatically groups dependent operations together into single, efficient Triton kernels. This keeps data in faster memory close to the register and cuts down on kernel overhead. In this article, we’ll look at a concrete example of fusion, and then outline topics for further reading. You’ll see exactly how torch.compile transforms your PyTorch operations into optimized GPU code.\n\nTo get the most out of this article, you should have basic familiarity with PyTorch and a general understanding of GPU programming concepts.\n\n## What is Vertical Fusion?\n\nThink of vertical fusion as a way to “link” steps, so the output of one goes straight into the next. It’s called “vertical” because if you picture the computation graph, these operations stack vertically – each one depends on the result of the previous step.\n\nThis is the most common fusion pattern in deep learning because neural networks are chains of operations: normalization, then linear layers, then activation functions, and so on. The big win is eliminating intermediate results – those temporary tensors never need to be written to or read from global memory. They stay in fast registers where the GPU can reach them more quickly.\n\nLet’s dive into an example of vertical fusion, namely pointwise fusion.\n\n## Pointwise Fusion Example\n\nPointwise operations are simple math kernels that work on each element: addition, multiplication, activation functions, and more. Let’s look at a pattern you might see in a neural network layer:\n\n*Pointwise PyTorch Example*\n\n### Unfused: Three Separate kernels\n\nWithout fusion, Inductor creates three separate Triton kernels. Don’t worry if the Triton syntax looks intimidating. The important part isn’t memorizing the syntax, but understanding the pattern: each kernel loads data, does one operation, and writes the result.\n\n*Kernel 1: Multiply*\n\nFor succinctness, we include just the signatures of the next kernels as they are nearly identical, see our [Git Repository ](https://gist.github.com/morrison-turnansky/0cc51b498c674aa23d4718ae200e6209)for the full source code.\n\n*Kernel 2: Add*\n\n*Kernel 3: Sigmoid*\n\nAcross the three kernels you’re performing eight memory operations: reading inputs twice for multiply, reading multiply’s result and the bias for add, reading add’s result for sigmoid, and writing all three results. That’s a lot of memory traffic.\n\n### Fused: One Kernel\n\nWith fusion, torch.compile creates a single kernel:\n\n*Kernel 4: Fused*\n\nNotice the difference: we load all inputs once, do all three operations in a row, and store only the final result. The intermediate values (`tmp2`\n\nand `tmp4`\n\n) stay in registers – the fastest memory on the GPU. They never touch the slower global memory.\n\n### Benefits\n\n**Kernel launches**: 3 reduced to 1** Intermediate buffers**: 2 eliminated (multiply result and add result)** Memory bandwidth**: Reading 5 full tensors and writing 3 full tensors (8 memory operations) reduced to reading 3 tensors and writing 1 (4 memory operations) – a 50% reduction in memory traffic\n\n## Other Fusion Types\n\nPointwise fusion is just one type of vertical fusion. Inductor uses other forms of vertical fusion to keep your GPU efficient:\n\n**Reduction Fusion**: Combines reducing operations like max, mean, or sum, with the operations that happen before and after them. This is critical for operations like batch normalization.\n\n**GEMM + Epilogue Fusion**: Attaches simple math to the end of heavy matrix calculations. Instead of doing a matrix multiply, writing the result to memory, then reading it back to add bias and apply ReLU, the bias and activation happen right after the multiply in the same kernel.\n\n**Prologue Fusion**: The opposite of epilogue – preprocessing happens as data loads. For instance, normalizing input before matrix multiplication can happen on-the-fly as the data comes in.\n\nIn addition to vertical fusion, the most prominent type of fusion, Inductor also uses horizontal fusion.\n\n**Horizontal Fusion**: Runs multiple independent operations on the same input at once. For example, computing both `sin(x)`\n\nand `cos(x)`\n\nin a single kernel, loading `x`\n\nonly once instead of twice.\n\n## Get Started: See Fusion in Your Own Code\n\nLet’s walk through a complete example using a reduction pattern.\n\n### Step 1: Create a Simple Reduction Example\n\nCreate a file called `fusion_example.py`\n\n:\n\n### Step 2: View the Generated Code\n\nRun your script with the `TORCH_LOGS`\n\nenvironment variable to see what Inductor generated:\n\nThis outputs the generated Triton kernels to your terminal. Look for a kernel named something like `triton_per_fused_add_mul_sum_0`\n\n. The `per`\n\nprefix means “per-reduction” kernel, and the name tells you that add, mul, and sum were all fused together.\n\n## Conclusion\n\nFusion is one of the most important optimizations that torch.compile does. By linking dependent operations into single kernels, it cuts down memory traffic and kernel overhead – often the main slowdowns in GPU work.\n\nTry accelerating your own code with torch compile. No need to change your implementation, just add a torch compiler decorator and let the compiler do the work.\n\n**Learn more**: PyTorch documentation at [pytorch.org/docs/stable/torch.compiler.html](http://pytorch.org/docs/stable/torch.compiler.html) has complete guides on compilation and optimization strategies. Reference our [Git Repository ](https://gist.github.com/morrison-turnansky/0cc51b498c674aa23d4718ae200e6209)for the full source code.", "url": "https://wpnews.pro/news/why-is-pytorch-compile-so-fast-kernel-fusion", "canonical_source": "https://pytorch.org/blog/why-is-pytorch-compile-so-fast-kernel-fusion/", "published_at": "2026-05-27 19:09:42+00:00", "updated_at": "2026-05-27 19:14:37.869452+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "ai-infrastructure", "ai-tools"], "entities": ["PyTorch", "Inductor", "Triton"], "alternates": {"html": "https://wpnews.pro/news/why-is-pytorch-compile-so-fast-kernel-fusion", "markdown": "https://wpnews.pro/news/why-is-pytorch-compile-so-fast-kernel-fusion.md", "text": "https://wpnews.pro/news/why-is-pytorch-compile-so-fast-kernel-fusion.txt", "jsonld": "https://wpnews.pro/news/why-is-pytorch-compile-so-fast-kernel-fusion.jsonld"}}