{"slug": "meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven", "title": "Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication", "summary": "Researchers from UC Berkeley's UCCL project released mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel to address GPU communication overhead. The project cites data showing communication consumes up to 43.6% of forward pass time and 47% of total execution time in Mixture-of-Experts models. mKernel replaces host-driven communication with GPU-driven networking to eliminate microsecond-scale orchestration overhead and enable finer-grained overlap between compute and communication at the tile or chunk level.", "body_md": "GPU communication overhead is a measurable bottleneck in production AI workloads. According to data cited by the mKernel project, communication can consume **43.6% of the forward pass and 32% of end-to-end training time**. Across popular Mixture-of-Experts (MoE) models, inter-device communication can account for **up to 47% of total execution time**. Researchers from UC Berkeley’s UCCL project have released mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel.\n\n**The Problem: Host-Driven Communication**\n\nThe standard model for multi-GPU communication is **host-driven**: the CPU runs the control path and calls into a library like NCCL or NVSHMEM. The library issues the collective operation — an AllReduce, an AllGather, etc. — across GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.\n\n**The research team identifies two problems with this approach**:\n\n(1) CPUs are not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core performance, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At those speeds, microsecond-scale host orchestration overhead — a `cudaLaunchKernel`\n\ncall, a CPU-side “all writes done” check, an inter-stream event — shows up directly as **pipeline bubbles**.\n\n(2) Host-driven systems overlap compute and communication at coarse kernel boundaries. Finer-grained overlap at the tile or chunk level is not possible from the host side.\n\nThe alternative is **GPU-driven communication**: the GPU itself triggers transfers, with communication fused into the same kernel as the compute. Most existing fused kernel libraries operate within a single node, or a single GPU. mKernel targets the multi-node case.\n\n**What mKernel Does**\n\nmKernel is a library of **persistent CUDA kernels**. Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel.\n\n**Multi-GPU + multi-node, in one kernel**: Both intra-node NVLink and inter-node RDMA live inside the same persistent kernel.\n\n**Fine-grained intra-kernel overlap**: Compute and communication overlap at tile/chunk granularity, covering both intra-node and inter-node GPU communication.\n\n**Persistent kernel with SM specialization**: CTAs self-assign roles: `compute`\n\n, `intra-comm`\n\n, `inter-send`\n\n, `inter-reduce`\n\n. The number of SMs dedicated to each role is tunable per shape.\n\n**GPU-driven networking built on libibverbs**: mKernel uses GPU-initiated RDMA writes without depending on NCCL or NVSHMEM. The communication backend is written from scratch to maximize performance and support heterogeneous networking devices.\n\n**The Five Fused Kernels**\n\n| Kernel | What it fuses | Description |\n|---|---|---|\nAllGather + GEMM | AllGather → GEMM | Each rank holds a shard of `A` . While ranks gather peers’ shards over NVLink/RDMA, the local GEMM consumes tiles as soon as they arrive. |\nGEMM + AllReduce | GEMM → AllReduce | Computes `C = A @ B` and reduces partial outputs across all ranks in one launch. Output tiles are pushed into the reduction tree the instant they’re produced. |\nMoE Dispatch + GEMM | All-to-All dispatch → grouped GEMM | Routes MoE tokens to their expert ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the same kernel. Tokens are processed as soon as they land — no staging buffer round-trip. |\nRing Attention | Ring KV exchange → FlashAttention | Sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk. Compute and the ring send/recv run concurrently inside a single persistent kernel. |\nGEMM + ReduceScatter | GEMM → ReduceScatter | Computes `C = A @ B` and reduce-scatters the output. Each output tile is reduced and forwarded to its owning rank as soon as it is produced. |\n\n**Evaluation Setup**\n\nThe research team evaluated mKernel on two 2-node × 8-H200 clusters that differ only in their inter-node fabric:\n\n| Testbed | Nodes × GPUs | Intra-node | Inter-node transport | NIC |\n|---|---|---|---|---|\nAWS EFA | 2 × 8 H200 | NVLink | AWS EFA / SRD | 16 × 200 Gb/s EFA per node |\nConnectX-7 | 2 × 8 H200 | NVLink | InfiniBand | 8 × 400 Gb/s NVIDIA ConnectX-7 per node |\n\nmKernel was benchmarked against NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The team notes that further benchmarking at larger scale is still in progress.\n\n**Backends and Requirements**\n\nmKernel supports two networking backends:\n\n| Backend | Macro | Transport | Where it runs |\n|---|---|---|---|\nCX7 | `-DINTERNODE_BACKEND_IBVERBS` | libibverbs RC | ConnectX-7 / InfiniBand / RoCE |\nEFA | `-DINTERNODE_BACKEND_EFA` | libibverbs + efadv (SRD) | AWS p5/p5e (H200, EFA) |\n\nBoth backends share the same host-side API and the same on-GPU kernel. Only the proxy/session implementation differs (`session.h`\n\nfor CX7, `session_efa.h`\n\nfor EFA). Requirements: NVIDIA Hopper GPUs (default build targets `sm_90a`\n\n), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs development headers and libraries. The EFA backend requires AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers under `EFA_HOME=/opt/amazon/efa`\n\nby default.\n\n**Marktechpost’s Visual Explainer**\n\n**Key Takeaways**\n\n- mKernel fuses intra-node NVLink, inter-node RDMA, and compute into a single persistent CUDA kernel.\n- Communication overhead accounts for up to 47% of execution time in MoE models per cited production data.\n- Five kernels are included: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter.\n- GPU-initiated RDMA is implemented directly via\n`libibverbs`\n\n— no NCCL or NVSHMEM dependency. - Currently requires Hopper GPUs (\n`sm_90a`\n\n) and ConnectX-7 or AWS EFA networking; Blackwell support is on the roadmap.\n\nCheck out the ** Repo **and\n\n**Also, feel free to follow us on**\n\n[Technical Details](https://uccl-project.github.io/posts/mkernel/).**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)\n\n**and Subscribe to**\n\n[150k+ ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**\n\n[our Newsletter](https://www.aidevsignals.com/)\n\n[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)", "url": "https://wpnews.pro/news/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven", "canonical_source": "https://www.marktechpost.com/2026/05/29/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven-communication/", "published_at": "2026-05-29 08:43:32+00:00", "updated_at": "2026-05-29 09:06:25.231997+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-chips", "ai-research", "machine-learning", "large-language-models"], "entities": ["UC Berkeley", "UCCL", "mKernel", "NCCL", "NVSHMEM", "NVLink", "RDMA", "Grace CPU"], "alternates": {"html": "https://wpnews.pro/news/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven", "markdown": "https://wpnews.pro/news/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven.md", "text": "https://wpnews.pro/news/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven.txt", "jsonld": "https://wpnews.pro/news/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven.jsonld"}}