Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

Researchers from UC Berkeley's UCCL project released mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel to address GPU communication overhead. The project cites data showing communication consumes up to 43.6% of forward pass time and 47% of total execution time in Mixture-of-Experts models. mKernel replaces host-driven communication with GPU-driven networking to eliminate microsecond-scale orchestration overhead and enable finer-grained overlap between compute and communication at the tile or chunk level.

GPU communication overhead is a measurable bottleneck in production AI workloads. According to data cited by the mKernel project, communication can consume 43.6% of the forward pass and 32% of end-to-end training time . Across popular Mixture-of-Experts MoE models, inter-device communication can account for up to 47% of total execution time . Researchers from UC Berkeley’s UCCL project have released mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel. The Problem: Host-Driven Communication The standard model for multi-GPU communication is host-driven : the CPU runs the control path and calls into a library like NCCL or NVSHMEM. The library issues the collective operation — an AllReduce, an AllGather, etc. — across GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries. The research team identifies two problems with this approach : 1 CPUs are not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core performance, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At those speeds, microsecond-scale host orchestration overhead — a cudaLaunchKernel call, a CPU-side “all writes done” check, an inter-stream event — shows up directly as pipeline bubbles . 2 Host-driven systems overlap compute and communication at coarse kernel boundaries. Finer-grained overlap at the tile or chunk level is not possible from the host side. The alternative is GPU-driven communication : the GPU itself triggers transfers, with communication fused into the same kernel as the compute. Most existing fused kernel libraries operate within a single node, or a single GPU. mKernel targets the multi-node case. What mKernel Does mKernel is a library of persistent CUDA kernels . Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel. Multi-GPU + multi-node, in one kernel : Both intra-node NVLink and inter-node RDMA live inside the same persistent kernel. Fine-grained intra-kernel overlap : Compute and communication overlap at tile/chunk granularity, covering both intra-node and inter-node GPU communication. Persistent kernel with SM specialization : CTAs self-assign roles: compute , intra-comm , inter-send , inter-reduce . The number of SMs dedicated to each role is tunable per shape. GPU-driven networking built on libibverbs : mKernel uses GPU-initiated RDMA writes without depending on NCCL or NVSHMEM. The communication backend is written from scratch to maximize performance and support heterogeneous networking devices. The Five Fused Kernels | Kernel | What it fuses | Description | |---|---|---| AllGather + GEMM | AllGather → GEMM | Each rank holds a shard of A . While ranks gather peers’ shards over NVLink/RDMA, the local GEMM consumes tiles as soon as they arrive. | GEMM + AllReduce | GEMM → AllReduce | Computes C = A @ B and reduces partial outputs across all ranks in one launch. Output tiles are pushed into the reduction tree the instant they’re produced. | MoE Dispatch + GEMM | All-to-All dispatch → grouped GEMM | Routes MoE tokens to their expert ranks intra-node NVLink + inter-node all-to-all and runs the per-expert grouped GEMM in the same kernel. Tokens are processed as soon as they land — no staging buffer round-trip. | Ring Attention | Ring KV exchange → FlashAttention | Sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk. Compute and the ring send/recv run concurrently inside a single persistent kernel. | GEMM + ReduceScatter | GEMM → ReduceScatter | Computes C = A @ B and reduce-scatters the output. Each output tile is reduced and forwarded to its owning rank as soon as it is produced. | Evaluation Setup The research team evaluated mKernel on two 2-node × 8-H200 clusters that differ only in their inter-node fabric: | Testbed | Nodes × GPUs | Intra-node | Inter-node transport | NIC | |---|---|---|---|---| AWS EFA | 2 × 8 H200 | NVLink | AWS EFA / SRD | 16 × 200 Gb/s EFA per node | ConnectX-7 | 2 × 8 H200 | NVLink | InfiniBand | 8 × 400 Gb/s NVIDIA ConnectX-7 per node | mKernel was benchmarked against NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The team notes that further benchmarking at larger scale is still in progress. Backends and Requirements mKernel supports two networking backends: | Backend | Macro | Transport | Where it runs | |---|---|---|---| CX7 | -DINTERNODE BACKEND IBVERBS | libibverbs RC | ConnectX-7 / InfiniBand / RoCE | EFA | -DINTERNODE BACKEND EFA | libibverbs + efadv SRD | AWS p5/p5e H200, EFA | Both backends share the same host-side API and the same on-GPU kernel. Only the proxy/session implementation differs session.h for CX7, session efa.h for EFA . Requirements: NVIDIA Hopper GPUs default build targets sm 90a , CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs development headers and libraries. The EFA backend requires AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers under EFA HOME=/opt/amazon/efa by default. Marktechpost’s Visual Explainer Key Takeaways - mKernel fuses intra-node NVLink, inter-node RDMA, and compute into a single persistent CUDA kernel. - Communication overhead accounts for up to 47% of execution time in MoE models per cited production data. - Five kernels are included: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter. - GPU-initiated RDMA is implemented directly via libibverbs — no NCCL or NVSHMEM dependency. - Currently requires Hopper GPUs sm 90a and ConnectX-7 or AWS EFA networking; Blackwell support is on the roadmap. Check out the Repo and Also, feel free to follow us on Technical Details https://uccl-project.github.io/posts/mkernel/ . and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58