{"slug": "fused-kernels-in-llms-reducing-memory-bandwidth-bottlenecks-through-gpu-kernel", "title": "Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel Fusion", "summary": "Shrijith Venkatramana, developer of git-lrc, explains how kernel fusion reduces memory bandwidth bottlenecks in LLM inference. By combining multiple GPU operations into a single kernel, intermediate data movement is minimized, significantly improving throughput. Examples include fused bias-add and activation, layer normalization, and FlashAttention, which avoids materializing large attention matrices.", "body_md": "*Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.*\n\nEvery few months, a new LLM appears claiming to be **2× faster**, **3× cheaper**, or capable of serving **millions more tokens per second**.\n\nMany developers assume the gains come from better GPUs or smaller models.\n\nOften, the real answer is far less glamorous:\n\n**Someone removed a few trips to memory.**\n\nOne of the most important performance techniques in modern LLM inference is **kernel fusion**. It doesn't change the model architecture. It doesn't improve accuracy. It doesn't make the AI smarter.\n\nIt simply makes the hardware spend less time waiting and more time computing.\n\nAnd in large-scale AI systems, that can mean the difference between serving thousands of users and serving millions.\n\nLet's dig into how fused kernels work, starting from intuition and moving down to GPU-level details.\n\nWhen developers first think about neural network performance, they usually focus on FLOPS.\n\nModern GPUs advertise enormous numbers:\n\nYet many LLM operations don't come close to using that compute capacity.\n\nThe reason is that a GPU spends a surprising amount of time moving data around.\n\nImagine a simple operation:\n\n```\ny = gelu(x + bias)\n```\n\nConceptually this is tiny.\n\nBut naively, the GPU may:\n\n`x`\n\n`bias`\n\nThe arithmetic is cheap.\n\nThe memory traffic is expensive.\n\nAs models grow into billions of parameters, memory movement becomes one of the dominant costs.\n\nBefore understanding fusion, we need to understand kernels.\n\nA GPU kernel is essentially a program launched on the GPU.\n\nFor example:\n\n```\nz = x + y\n```\n\nmight launch one kernel.\n\nThen:\n\n```\noutput = relu(z)\n```\n\nmight launch another.\n\nThen:\n\n```\noutput = output * scale\n```\n\nmight launch a third.\n\nEach kernel launch has overhead:\n\nThe GPU repeatedly moves intermediate results between global memory and compute units.\n\nThose extra movements add up quickly.\n\nKernel fusion combines multiple operations into a single GPU kernel.\n\nInstead of:\n\n```\nz = x + bias\na = gelu(z)\noutput = a * scale\n```\n\nwe create one fused operation:\n\n```\noutput = scale * gelu(x + bias)\n```\n\nNow the GPU can:\n\nNo intermediate tensors are stored in global memory.\n\nVisually:\n\n**Without fusion**\n\n```\nMemory → Add\n          ↓\n       Memory\n          ↓\n        GELU\n          ↓\n       Memory\n          ↓\n       Scale\n          ↓\n       Memory\n```\n\n**With fusion**\n\n```\nMemory → Add → GELU → Scale → Memory\n```\n\nThe computation is identical.\n\nThe data movement is dramatically reduced.\n\nModern transformers contain many opportunities for fusion.\n\nA few common examples:\n\nInstead of:\n\n```\nhidden = linear(x)\nhidden += bias\nhidden = gelu(hidden)\n```\n\nThe bias addition and activation are fused.\n\nThis is common in transformer MLP blocks.\n\nLayer normalization requires:\n\nNaively these can involve multiple passes through memory.\n\nOptimized kernels perform much of the work in one fused operation.\n\nAttention layers require softmax:\n\n```\nsoftmax(QK^T)\n```\n\nImplementations often fuse:\n\ninto a single kernel.\n\nThis reduces memory traffic significantly.\n\nOne of the best-known examples of fusion is Tri Dao's FlashAttention.\n\nThe traditional attention pipeline looks roughly like:\n\n```\nQK^T\n ↓\nStore matrix\n ↓\nMask\n ↓\nStore matrix\n ↓\nSoftmax\n ↓\nStore matrix\n ↓\nMultiply by V\n```\n\nThe intermediate attention matrix can be enormous.\n\nFor long contexts it becomes a major bottleneck.\n\nFlashAttention reorganizes the computation so that large intermediate matrices never need to be materialized in global memory.\n\nInstead:\n\nThe result is dramatically lower memory usage and substantially higher throughput.\n\nThis single optimization helped unlock much longer context windows for modern LLMs.\n\nLet's go one level deeper.\n\nModern GPUs have a hierarchy:\n\n```\nGlobal Memory (HBM)\n        ↓\nL2 Cache\n        ↓\nShared Memory\n        ↓\nRegisters\n```\n\nGlobal memory is large but relatively slow.\n\nRegisters are extremely fast but tiny.\n\nFusion attempts to keep intermediate values as close to registers as possible.\n\nInstead of:\n\n```\nCompute\n ↓\nWrite to HBM\n ↓\nRead from HBM\n ↓\nCompute\n```\n\nwe get:\n\n```\nCompute\n ↓\nRegister\n ↓\nCompute\n ↓\nRegister\n ↓\nCompute\n```\n\nThis drastically increases arithmetic intensity:\n\n```\nUseful Computation\n------------------\nBytes Moved\n```\n\nHigher arithmetic intensity generally means better GPU utilization.\n\nThis is why fusion often produces large speedups even when the number of mathematical operations stays exactly the same.\n\nIf fusion is so beneficial, why not fuse everything?\n\nBecause fusion introduces complexity.\n\nSeveral challenges emerge:\n\nEvery intermediate value consumes registers.\n\nToo many registers reduce occupancy.\n\nA fused kernel may contain dozens of operations.\n\nGenerating optimal GPU code becomes difficult.\n\nA kernel optimized for:\n\nmay require different strategies.\n\nInstead of debugging:\n\n```\nAdd\nGELU\nMultiply\n```\n\nyou debug:\n\n```\nFusedAddGeluMultiplyLayerNormKernel_v7\n```\n\nwhich is considerably less pleasant.\n\nThis is one reason projects such as:\n\nhave become increasingly important.\n\nThey help automate kernel generation and fusion.\n\nFused kernels are one of those optimizations that seem almost boring at first glance.\n\nNo new model architecture.\n\nNo breakthrough algorithm.\n\nNo clever prompting technique.\n\nYet they are responsible for a significant portion of the performance gains that make modern LLM systems practical.\n\nThe key insight is simple:\n\n**In large-scale AI systems, moving data is often more expensive than computing on it.**\n\nKernel fusion reduces unnecessary memory traffic, keeps data closer to the GPU's compute units, and allows the hardware to spend more time doing useful work.\n\nThe next time you hear that a new LLM stack is dramatically faster, don't just ask about quantization, caching, or model architecture.\n\nAsk:\n\n**How much of that speedup came from fused kernels?**\n\n**Question for readers:** Have you ever profiled an ML workload and discovered that memory movement—not computation—was the real bottleneck?\n\n*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.\n\ngit-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*\n\nAny feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.\n\n| [🇩🇰 Dansk](https://github.com/HexmosTech/git-lrc/readme/README.da.md) | [🇪🇸 Español](https://github.com/HexmosTech/git-lrc/readme/README.es.md) | [🇮🇷 Farsi](https://github.com/HexmosTech/git-lrc/readme/README.fa.md) | [🇫🇮 Suomi](https://github.com/HexmosTech/git-lrc/readme/README.fi.md) | [🇯🇵 日本語](https://github.com/HexmosTech/git-lrc/readme/README.ja.md) | [🇳🇴 Norsk](https://github.com/HexmosTech/git-lrc/readme/README.nn.md) | [🇵🇹 Português](https://github.com/HexmosTech/git-lrc/readme/README.pt.md) | [🇷🇺 Русский](https://github.com/HexmosTech/git-lrc/readme/README.ru.md) | [🇦🇱 Shqip](https://github.com/HexmosTech/git-lrc/readme/README.sq.md) | [🇨🇳 中文](https://github.com/HexmosTech/git-lrc/readme/README.zh.md) | [🇮🇳 हिन्दी](https://github.com/HexmosTech/git-lrc/readme/README.hi.md) |\n\nGenAI today is a **race car without brakes**. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents *silently break things*: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.\n\n** git-lrc is your braking system.** It hooks into\n\n`git commit`\n\nand runs an AI review on every diff In short, git-lrc helps **Prevent Outages, Breaches, and Technical Debt Before They Happen**\n\n**At a glance:** [10 risk categories](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · [100+ failure patterns tracked](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · every commit…", "url": "https://wpnews.pro/news/fused-kernels-in-llms-reducing-memory-bandwidth-bottlenecks-through-gpu-kernel", "canonical_source": "https://dev.to/shrsv/fused-kernels-in-llms-reducing-memory-bandwidth-bottlenecks-through-gpu-kernel-fusion-4fkm", "published_at": "2026-06-15 18:15:42+00:00", "updated_at": "2026-06-15 18:36:52.557605+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-research"], "entities": ["Shrijith Venkatramana", "git-lrc", "Tri Dao", "FlashAttention", "GPU"], "alternates": {"html": "https://wpnews.pro/news/fused-kernels-in-llms-reducing-memory-bandwidth-bottlenecks-through-gpu-kernel", "markdown": "https://wpnews.pro/news/fused-kernels-in-llms-reducing-memory-bandwidth-bottlenecks-through-gpu-kernel.md", "text": "https://wpnews.pro/news/fused-kernels-in-llms-reducing-memory-bandwidth-bottlenecks-through-gpu-kernel.txt", "jsonld": "https://wpnews.pro/news/fused-kernels-in-llms-reducing-memory-bandwidth-bottlenecks-through-gpu-kernel.jsonld"}}