# Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

> Source: <https://www.marktechpost.com/2026/05/29/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven-communication/>
> Published: 2026-05-29 08:43:32+00:00

GPU communication overhead is a measurable bottleneck in production AI workloads. According to data cited by the mKernel project, communication can consume **43.6% of the forward pass and 32% of end-to-end training time**. Across popular Mixture-of-Experts (MoE) models, inter-device communication can account for **up to 47% of total execution time**. Researchers from UC Berkeley’s UCCL project have released mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel.

**The Problem: Host-Driven Communication**

The standard model for multi-GPU communication is **host-driven**: the CPU runs the control path and calls into a library like NCCL or NVSHMEM. The library issues the collective operation — an AllReduce, an AllGather, etc. — across GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.

**The research team identifies two problems with this approach**:

(1) CPUs are not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core performance, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At those speeds, microsecond-scale host orchestration overhead — a `cudaLaunchKernel`

call, a CPU-side “all writes done” check, an inter-stream event — shows up directly as **pipeline bubbles**.

(2) Host-driven systems overlap compute and communication at coarse kernel boundaries. Finer-grained overlap at the tile or chunk level is not possible from the host side.

The alternative is **GPU-driven communication**: the GPU itself triggers transfers, with communication fused into the same kernel as the compute. Most existing fused kernel libraries operate within a single node, or a single GPU. mKernel targets the multi-node case.

**What mKernel Does**

mKernel is a library of **persistent CUDA kernels**. Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel.

**Multi-GPU + multi-node, in one kernel**: Both intra-node NVLink and inter-node RDMA live inside the same persistent kernel.

**Fine-grained intra-kernel overlap**: Compute and communication overlap at tile/chunk granularity, covering both intra-node and inter-node GPU communication.

**Persistent kernel with SM specialization**: CTAs self-assign roles: `compute`

, `intra-comm`

, `inter-send`

, `inter-reduce`

. The number of SMs dedicated to each role is tunable per shape.

**GPU-driven networking built on libibverbs**: mKernel uses GPU-initiated RDMA writes without depending on NCCL or NVSHMEM. The communication backend is written from scratch to maximize performance and support heterogeneous networking devices.

**The Five Fused Kernels**

| Kernel | What it fuses | Description |
|---|---|---|
AllGather + GEMM | AllGather → GEMM | Each rank holds a shard of `A` . While ranks gather peers’ shards over NVLink/RDMA, the local GEMM consumes tiles as soon as they arrive. |
GEMM + AllReduce | GEMM → AllReduce | Computes `C = A @ B` and reduces partial outputs across all ranks in one launch. Output tiles are pushed into the reduction tree the instant they’re produced. |
MoE Dispatch + GEMM | All-to-All dispatch → grouped GEMM | Routes MoE tokens to their expert ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the same kernel. Tokens are processed as soon as they land — no staging buffer round-trip. |
Ring Attention | Ring KV exchange → FlashAttention | Sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk. Compute and the ring send/recv run concurrently inside a single persistent kernel. |
GEMM + ReduceScatter | GEMM → ReduceScatter | Computes `C = A @ B` and reduce-scatters the output. Each output tile is reduced and forwarded to its owning rank as soon as it is produced. |

**Evaluation Setup**

The research team evaluated mKernel on two 2-node × 8-H200 clusters that differ only in their inter-node fabric:

| Testbed | Nodes × GPUs | Intra-node | Inter-node transport | NIC |
|---|---|---|---|---|
AWS EFA | 2 × 8 H200 | NVLink | AWS EFA / SRD | 16 × 200 Gb/s EFA per node |
ConnectX-7 | 2 × 8 H200 | NVLink | InfiniBand | 8 × 400 Gb/s NVIDIA ConnectX-7 per node |

mKernel was benchmarked against NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The team notes that further benchmarking at larger scale is still in progress.

**Backends and Requirements**

mKernel supports two networking backends:

| Backend | Macro | Transport | Where it runs |
|---|---|---|---|
CX7 | `-DINTERNODE_BACKEND_IBVERBS` | libibverbs RC | ConnectX-7 / InfiniBand / RoCE |
EFA | `-DINTERNODE_BACKEND_EFA` | libibverbs + efadv (SRD) | AWS p5/p5e (H200, EFA) |

Both backends share the same host-side API and the same on-GPU kernel. Only the proxy/session implementation differs (`session.h`

for CX7, `session_efa.h`

for EFA). Requirements: NVIDIA Hopper GPUs (default build targets `sm_90a`

), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs development headers and libraries. The EFA backend requires AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers under `EFA_HOME=/opt/amazon/efa`

by default.

**Marktechpost’s Visual Explainer**

**Key Takeaways**

- mKernel fuses intra-node NVLink, inter-node RDMA, and compute into a single persistent CUDA kernel.
- Communication overhead accounts for up to 47% of execution time in MoE models per cited production data.
- Five kernels are included: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter.
- GPU-initiated RDMA is implemented directly via
`libibverbs`

— no NCCL or NVSHMEM dependency. - Currently requires Hopper GPUs (
`sm_90a`

) and ConnectX-7 or AWS EFA networking; Blackwell support is on the roadmap.

Check out the ** Repo **and

**Also, feel free to follow us on**

[Technical Details](https://uccl-project.github.io/posts/mkernel/).**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)

**and Subscribe to**

[150k+ ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**

[our Newsletter](https://www.aidevsignals.com/)

[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)