# Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel Fusion

> Source: <https://dev.to/shrsv/fused-kernels-in-llms-reducing-memory-bandwidth-bottlenecks-through-gpu-kernel-fusion-4fkm>
> Published: 2026-06-15 18:15:42+00:00

*Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.*

Every few months, a new LLM appears claiming to be **2× faster**, **3× cheaper**, or capable of serving **millions more tokens per second**.

Many developers assume the gains come from better GPUs or smaller models.

Often, the real answer is far less glamorous:

**Someone removed a few trips to memory.**

One of the most important performance techniques in modern LLM inference is **kernel fusion**. It doesn't change the model architecture. It doesn't improve accuracy. It doesn't make the AI smarter.

It simply makes the hardware spend less time waiting and more time computing.

And in large-scale AI systems, that can mean the difference between serving thousands of users and serving millions.

Let's dig into how fused kernels work, starting from intuition and moving down to GPU-level details.

When developers first think about neural network performance, they usually focus on FLOPS.

Modern GPUs advertise enormous numbers:

Yet many LLM operations don't come close to using that compute capacity.

The reason is that a GPU spends a surprising amount of time moving data around.

Imagine a simple operation:

```
y = gelu(x + bias)
```

Conceptually this is tiny.

But naively, the GPU may:

`x`

`bias`

The arithmetic is cheap.

The memory traffic is expensive.

As models grow into billions of parameters, memory movement becomes one of the dominant costs.

Before understanding fusion, we need to understand kernels.

A GPU kernel is essentially a program launched on the GPU.

For example:

```
z = x + y
```

might launch one kernel.

Then:

```
output = relu(z)
```

might launch another.

Then:

```
output = output * scale
```

might launch a third.

Each kernel launch has overhead:

The GPU repeatedly moves intermediate results between global memory and compute units.

Those extra movements add up quickly.

Kernel fusion combines multiple operations into a single GPU kernel.

Instead of:

```
z = x + bias
a = gelu(z)
output = a * scale
```

we create one fused operation:

```
output = scale * gelu(x + bias)
```

Now the GPU can:

No intermediate tensors are stored in global memory.

Visually:

**Without fusion**

```
Memory → Add
          ↓
       Memory
          ↓
        GELU
          ↓
       Memory
          ↓
       Scale
          ↓
       Memory
```

**With fusion**

```
Memory → Add → GELU → Scale → Memory
```

The computation is identical.

The data movement is dramatically reduced.

Modern transformers contain many opportunities for fusion.

A few common examples:

Instead of:

```
hidden = linear(x)
hidden += bias
hidden = gelu(hidden)
```

The bias addition and activation are fused.

This is common in transformer MLP blocks.

Layer normalization requires:

Naively these can involve multiple passes through memory.

Optimized kernels perform much of the work in one fused operation.

Attention layers require softmax:

```
softmax(QK^T)
```

Implementations often fuse:

into a single kernel.

This reduces memory traffic significantly.

One of the best-known examples of fusion is Tri Dao's FlashAttention.

The traditional attention pipeline looks roughly like:

```
QK^T
 ↓
Store matrix
 ↓
Mask
 ↓
Store matrix
 ↓
Softmax
 ↓
Store matrix
 ↓
Multiply by V
```

The intermediate attention matrix can be enormous.

For long contexts it becomes a major bottleneck.

FlashAttention reorganizes the computation so that large intermediate matrices never need to be materialized in global memory.

Instead:

The result is dramatically lower memory usage and substantially higher throughput.

This single optimization helped unlock much longer context windows for modern LLMs.

Let's go one level deeper.

Modern GPUs have a hierarchy:

```
Global Memory (HBM)
        ↓
L2 Cache
        ↓
Shared Memory
        ↓
Registers
```

Global memory is large but relatively slow.

Registers are extremely fast but tiny.

Fusion attempts to keep intermediate values as close to registers as possible.

Instead of:

```
Compute
 ↓
Write to HBM
 ↓
Read from HBM
 ↓
Compute
```

we get:

```
Compute
 ↓
Register
 ↓
Compute
 ↓
Register
 ↓
Compute
```

This drastically increases arithmetic intensity:

```
Useful Computation
------------------
Bytes Moved
```

Higher arithmetic intensity generally means better GPU utilization.

This is why fusion often produces large speedups even when the number of mathematical operations stays exactly the same.

If fusion is so beneficial, why not fuse everything?

Because fusion introduces complexity.

Several challenges emerge:

Every intermediate value consumes registers.

Too many registers reduce occupancy.

A fused kernel may contain dozens of operations.

Generating optimal GPU code becomes difficult.

A kernel optimized for:

may require different strategies.

Instead of debugging:

```
Add
GELU
Multiply
```

you debug:

```
FusedAddGeluMultiplyLayerNormKernel_v7
```

which is considerably less pleasant.

This is one reason projects such as:

have become increasingly important.

They help automate kernel generation and fusion.

Fused kernels are one of those optimizations that seem almost boring at first glance.

No new model architecture.

No breakthrough algorithm.

No clever prompting technique.

Yet they are responsible for a significant portion of the performance gains that make modern LLM systems practical.

The key insight is simple:

**In large-scale AI systems, moving data is often more expensive than computing on it.**

Kernel fusion reduces unnecessary memory traffic, keeps data closer to the GPU's compute units, and allows the hardware to spend more time doing useful work.

The next time you hear that a new LLM stack is dramatically faster, don't just ask about quantization, caching, or model architecture.

Ask:

**How much of that speedup came from fused kernels?**

**Question for readers:** Have you ever profiled an ML workload and discovered that memory movement—not computation—was the real bottleneck?

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

| [🇩🇰 Dansk](https://github.com/HexmosTech/git-lrc/readme/README.da.md) | [🇪🇸 Español](https://github.com/HexmosTech/git-lrc/readme/README.es.md) | [🇮🇷 Farsi](https://github.com/HexmosTech/git-lrc/readme/README.fa.md) | [🇫🇮 Suomi](https://github.com/HexmosTech/git-lrc/readme/README.fi.md) | [🇯🇵 日本語](https://github.com/HexmosTech/git-lrc/readme/README.ja.md) | [🇳🇴 Norsk](https://github.com/HexmosTech/git-lrc/readme/README.nn.md) | [🇵🇹 Português](https://github.com/HexmosTech/git-lrc/readme/README.pt.md) | [🇷🇺 Русский](https://github.com/HexmosTech/git-lrc/readme/README.ru.md) | [🇦🇱 Shqip](https://github.com/HexmosTech/git-lrc/readme/README.sq.md) | [🇨🇳 中文](https://github.com/HexmosTech/git-lrc/readme/README.zh.md) | [🇮🇳 हिन्दी](https://github.com/HexmosTech/git-lrc/readme/README.hi.md) |

GenAI today is a **race car without brakes**. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents *silently break things*: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

** git-lrc is your braking system.** It hooks into

`git commit`

and runs an AI review on every diff In short, git-lrc helps **Prevent Outages, Breaches, and Technical Debt Before They Happen**

**At a glance:** [10 risk categories](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · [100+ failure patterns tracked](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · every commit…