{"slug": "infiniband-roce-and-all-that", "title": "InfiniBand, RoCE, and all that", "summary": "InfiniBand, a high-performance interconnect technology designed for Remote Direct Memory Access (RDMA), has become critical for AI training and inference workloads that require direct data movement between application buffers without CPU intervention. Despite losing to PCIe, Ethernet, and Fibre Channel in general-purpose I/O, InfiniBand's lossless fabric and credit-based flow control make it essential for GPU synchronization in distributed AI systems.", "body_md": "# InfiniBand, RoCE, and all that\n\n[Illustrated London News](https://commons.wikimedia.org/wiki/File:Opening_of_the_Pneumatic_Despatch_Mail_Service.jpg), 28 Feb. 1863.\n\nThe standard path for sending data over the network if you’re on commodity hardware works like this:\n\nThe standard path. On each host the data is copied through a kernel buffer (the orange hops), so the CPU is in the critical path on both ends.\n\nThere are variations and optimizations, but the basic shape has the kernel involved on both sides, and the data being copied into a kernel buffer. For a web server handling HTTP requests this is fine, since the per-message overhead is negligible relative to all the stuff the application wants to do with the data.\n\nThere are lots of places where this is not so fine. The ones that matter most,\nnowadays, are in AI training & inference. In training, a gradient all-reduce —\nthe step where hundreds of GPUs each combine their gradients with everyone\nelse’s before the next step can start — is a barrier that every GPU has to wait\non. In distributed inference, an [expert parallel\nkernel](https://fergusfinn.com/blog/anatomy-of-a-high-performance-ep-kernel/)\nsynchronizes the waiting GPUs in much the same way EP is just an example, all the other parallelisms (except data-parallel)\nhave the same property..\n\nWhat these workloads need is for data to move directly from an application buffer on one machine to an application buffer on another, without either CPU touching it in the critical path, and without a memory copy. This is Remote Direct Memory Access (RDMA):\n\nRDMA. The NIC reads and writes the application buffers directly, so the kernel is bypassed and nothing is copied.\n\nBuilding hardware that provides it reliably and efficiently is the problem InfiniBand is designed to solve.\n\n## InfiniBand as the answer\n\nIn 1999, the industry agreed that RDMA was something that needed doing. Two\ncompeting proposals — Future I/OA switched-fabric interconnect for host-to-host and host-to-I/O\ncommunication. See [Future I/O](https://ieeexplore.ieee.org/document/806417)\n(IEEE)., backed by Compaq, HP, and IBM, and Next\nGeneration I/OIntel announced NGIO in November 1998: [Intel Introduces Next Generation\nI/O for Computing\nServers](https://www.intel.com/pressroom/archive/releases/1998/sp111198.htm)., backed by Intel, Microsoft, and\nSun — merged into a single effort and produced the InfiniBand Trade\nAssociationVersion 1.0 of the spec followed in 2000. The vision went further than\neven the I/O-stack framing suggests: devices would attach to the fabric as\nendpoints rather than as slots on a local bus. [Wikipedia has a good summary of\nthe early history.](https://en.wikipedia.org/wiki/InfiniBand). The ambition was extraordinary: InfiniBand was not designed as\na networking technology but as a replacement for the entire server I/O stack—the\nPCI bus for device I/O, Ethernet for networking, Fibre Channel for storage.\n\nThe result was designed to be technically coherent from the ground up. The big\nidea is credit-based flow control at the link layer: a sender cannot transmit\nunless the receiver has signaled it has buffer space. This makes the fabric\ninherently lossless. Losslessness isn’t strictly required for RDMA, but it\nmakes the transport much simpler: nothing in the fast path has to recover from\na dropped packet. The programming\nmodel that grew up around this, “the verbs API”“Verbs” is not a formal API specification. The IBTA spec defines a set of\nabstract operations — `ibv_post_send`\n\n, `ibv_open_device`\n\n, and so on — that must\nexist and behave in certain ways, without prescribing an exact interface. The\nde facto implementation is\n[libibverbs](https://github.com/linux-rdma/rdma-core), developed by the\nOpenFabrics Alliance and merged into the Linux kernel in 2005., is a\ncoherent stack built on the guarantees the fabric provides.\n\nInfiniBand then lost almost every battle it entered. PCIe won the device I/O bus. Ethernet held general networking. Fibre Channel held storage. The main place InfiniBand survived and thrived was high-performance computing: fluid dynamics, molecular dynamics, climate models. At that kind of scale and coupling, interconnect latency is a direct ceiling on how fast the simulation runs, and people would pay for a dedicated fabric to push that ceiling up.\n\nThe founding consortium members mostly lost interest as the empire shrank to a\nsingle niche they were not primarily in the business of serving. The company\nthat remained was\n[Mellanox](https://en.wikipedia.org/wiki/Mellanox_Technologies), founded in\n1999 specifically to build InfiniBand silicon, which ended up dominating the InfiniBand NIC\nand switch market, consolidating it further by buying rivals like the switch\nvendor [Voltaire](https://www.networkworld.com/article/2196031/mellanox-acquires-voltaire.html).\nThe people who still needed InfiniBand\nreally needed it, and they had nowhere else to go. That concentration\nbuilt slowly over twenty years until NVIDIA acquired Mellanox in 2020 for\n$6.9 billionAt the time, NVIDIA’s largest acquisition. Intel had made a late attempt\nto compete with its own fabric, Omni-Path, after acquiring QLogic’s InfiniBand\nassets in 2012 — and then exited the market entirely in 2019, selling the\nbusiness to Cornelis Networks. That left Mellanox as the uncontested monopolist\njust as AI training demand was beginning to make InfiniBand switches and NICs\nvery valuable hardware. on the logic that controlling the GPU and the\ninterconnect together meant owning the full AI infrastructure stack.\n\nThe island quality of InfiniBand is inseparable from its technical coherence. It is a complete, purpose-built stack: its own physical layer, its own link-layer flow control, its own routing, its own transport, its own programming model. Each layer was designed against the one below it, and the result works extremely well. The cost is that every machine on an InfiniBand fabric needs InfiniBand cables, InfiniBand switches, InfiniBand NICs, and operators who understand all of it. Anywhere the benefit was not compelling enough to justify a dedicated fabric, it simply did not go.\n\n## The Allure of Ethernet\n\nThe cost of InfiniBand’s island quality was more than just the high cost of Mellanox hardware. Every server needed two sets of cables, two NICs (one Ethernet for TCP traffic, one InfiniBand for RDMA), two switch fabrics, and teams with expertise in different systems. Storage then also ran on a third fabric, Fibre Channel, with its own specialists. Large datacenters in the mid-2000s were managing three distinct networks that didn’t talk to each other, each with its own cabling, spares, and operational procedures.\n\nThe obvious solution was to consolidate everything onto Ethernet, which\neveryone already owned and understood. The problem is that standard Ethernet is\nlossy by design: when a switch buffer fills up, it drops packets. For TCP\ntraffic this is acceptable, because TCP detects loss at its layer, and\nretransmits. For the RDMA protocol, however, it is not. The transport was built assuming a\nlossless fabric, so its recovery is crude: a single dropped packet forces the\nconnection — a *queue pair*, in RDMA terms — to retransmit everything sent\nafter it ([go-back-N](https://en.wikipedia.org/wiki/Go-Back-N_ARQ)), and\nsustained loss tips the queue pair into an error state the application has to\ntear down and reconnect.\n\nPeople really did want to run RDMA over Ethernet though. The fix was Data\nCenter Bridging (DCB)[Data Center\nBridging](https://en.wikipedia.org/wiki/Data_center_bridging) on Wikipedia is a\ngood overview; the IEEE 802.1 [DCB](https://1.ieee802.org/dcb/) task group pages\ncollect the standards and project history., a set of IEEE extensions\ndeveloped in the late 2000s specifically to make Ethernet lossless enough for\nstorage and RDMA traffic. The key piece is [Priority Flow\nControl](https://en.wikipedia.org/wiki/Ethernet_flow_control) (PFC). When a\nswitch buffer begins to fill, it sends a PAUSE frame upstream on a\nper-priority basis, telling the sender to stop transmitting that traffic class\nbefore any packets are dropped. Traffic classes that don’t need lossless\nbehavior (ordinary TCP) get their own priority lane and are unaffected Two supporting standards also helped: Enhanced Transmission Selection\n([ETS](https://en.wikipedia.org/wiki/Enhanced_Transmission_Selection))\nschedules bandwidth across priority classes, and\n[DCBX](https://www.juniper.net/documentation/us/en/software/junos/traffic-mgmt-qfx/topics/concept/fibre-channel-dcbx-understanding.html)\nhandles discovery and negotiation so that switches and NICs agree on which\nclasses exist and what rules apply to each..\n\nPFC achieves losslessness by backpressure signaling. Once a buffer fills, a\nPAUSE frame gets sent upstream to the sender. If that sender’s buffers are also\nfilling, it has to PAUSE its upstream neighbor in turn. The PAUSE can cascade\nbackward through the fabric. In pathological cases, these cascading pauses can\nspread congestion across links that had nothing to do with the original\nproblemThis failure mode is called a [PFC pause storm](https://www.naddod.com/blog/pfc-flow-control-technology-and-challenges-in-rocev2-network-deployment?srsltid=AfmBOooDyNl4yDdqUmNPyUBHR4LE_hOFBrRZuILSRSTdQ96YKAvlxfEg). — a failure mode some designs sidestep by giving up on lossless\nentirelyAWS’s Elastic Fabric Adapter (EFA) does exactly that: its SRD (Scalable\nReliable Datagram) transport, exposed through a custom libfabric provider,\nsprays each flow across many paths and reorders in software — reliable delivery\nover ordinary lossy Ethernet, no PFC required. See [A Cloud-Optimized Transport\nProtocol for Elastic and Scalable\nHPC](https://www.amazon.science/publications/a-cloud-optimized-transport-protocol-for-elastic-and-scalable-hpc)\n(IEEE Micro, 2020)..\n\nBut it works well enough. Once you have a lossless Ethernet substrate, the obstacle to running RDMA over it is removed.\n\n## RoCE: IB on Ethernet\n\nTo run RDMA on Ethernet, RoCE (RDMA over Converged Ethernet) just takes the InfiniBand transport layer and runs it directly over Ethernet, changing as little as possible. A NIC that speaks RoCE and a NIC that speaks InfiniBand implement the same programming model and differ only at the link layer.\n\nThe first version, RoCE v1, put the IB transport directly inside an Ethernet\nframe. It was not IP-routable and worked only within a single broadcast domain.\nRoCE v2 wrapped the same transport in UDP/IP (port 4791), making it routable\nacross subnets and compatible with standard\n[ECMP](https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing) load\nbalancing across [fat-tree](https://dl.acm.org/doi/10.1145/1402958.1402967)\ntopologiesEqual-Cost Multi-Path ([ECMP](https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing))\nhashes flows across parallel paths, which is how a\nfat-tree distributes traffic. With RoCE v2, a flow is a standard IP 5-tuple and\nhashes normally. RoCE v1 is Ethernet-only and doesn’t hash across IP paths,\nwhich limits the topologies it can use.. RoCE v2 is what everyone uses today.\nThe lossless substrate is still required (DCB and PFC configured end to end),\nbut the traffic now looks like UDP to every piece of infrastructure between the\nendpoints, which routes easily across standard networks.\n\nThe economic argument for this at scale is straightforward. InfiniBand switches are specialty hardware sold by one company, whereas Ethernet switches are commodity products with many vendors competing on price. Ethernet also brings the operational advantage: the people who run these networks already know it, already have monitoring and tooling built around it, and don’t need a separate team to manage a parallel IB fabric.\n\nThis is why much of the hyperscale buildout converged on RoCE for AI training.\nMeta, for instance,\n[built](https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/)\na matched pair of 24k-GPU clusters for Llama 3 — one on RoCE, one on InfiniBand\n— and ran its largest model on the RoCE side, a fair signal that Ethernet RDMA\nholds up at the top end. At the same link rate the two perform comparably\nenough that the choice comes down to economics and operational simplicity.\n\nThe competitive pressure this created is visible in NVIDIA’s own product line.\nHaving [acquired\nMellanox](https://nvidianews.nvidia.com/news/nvidia-to-acquire-mellanox-for-6-9-billion)\nand with it a monopoly on InfiniBand, NVIDIA simultaneously sells Spectrum-X,\nan Ethernet networking platform optimized for AI workloads, with its own\ncongestion control and adaptive routing layered on top of standard\nEthernetSemiAnalysis covers the competitive dynamics in detail: [Nvidia’s\nInfiniBand\nProblem](https://newsletter.semianalysis.com/p/nvidias-infiniband-problem-qmx-ai)..\n\n## Where things stand\n\nThe market has not settled on a single answer.\n\nInfiniBand remains strong in traditional HPC and in NVIDIA’s own reference\narchitectures for large GPU clusters. Where island performance is the primary\nconstraint, or NVIDIA gets to choose the design (its [DGX\nSuperPOD](https://www.nvidia.com/en-us/data-center/dgx-superpod/) reference\nclusters), InfiniBand still wins. NVIDIA\nsells both the GPU and the interconnect into these environments, and has every\nincentive to keep IB competitive.\n\nRoCE is winning at hyperscale. The large cloud providers and AI labs building the biggest GPU clusters have converged on Ethernet-based RDMA, driven by economics and the operational leverage of a fabric they already understand.\n\nProprietary fabrics occupy the remaining niches: Slingshot for HPE’s government\nHPC systemsSlingshot comes out of [Cray](https://en.wikipedia.org/wiki/Cray)’s\ninterconnect group — the team behind\n[Gemini](https://ieeexplore.ieee.org/document/5577317/) and\n[Aries](https://dl.acm.org/doi/10.5555/2388996.2389136) — which HPE inherited\nwhen it acquired Cray in 2019, though\n[Slingshot](https://arxiv.org/abs/2008.08886) itself is a ground-up\nEthernet-compatible design rather than a continuation of the proprietary Aries\nfabric. — [Frontier](https://www.olcf.ornl.gov/frontier/) at Oak Ridge,\n[Aurora](https://www.alcf.anl.gov/aurora) at Argonne, and\n[Isambard](https://www.bristol.ac.uk/research/centres/bristol-supercomputing/)\nat Bristol — and EFA for AWS, whose SRD transport was designed for their own\ncloud network rather than adopted from IB or RoCE.\n\nThere are now lots of different ways to do RDMA. Which one you pick comes down to what you can afford, what you already run, and who you’re willing to depend on.\n\n[Suggest an edit](https://github.com/fergusfinn/blog/edit/main/src/content/blog/infiniband-roce-rdma.mdx)\n\nLast modified: 19 Jun 2026", "url": "https://wpnews.pro/news/infiniband-roce-and-all-that", "canonical_source": "https://fergusfinn.com/blog/infiniband-roce-rdma/", "published_at": "2026-06-19 00:00:00+00:00", "updated_at": "2026-06-19 08:06:39.999872+00:00", "lang": "en", "topics": ["ai-infrastructure"], "entities": ["InfiniBand", "RDMA", "GPU", "PCIe", "Ethernet", "Fibre Channel", "OpenFabrics Alliance", "libibverbs"], "alternates": {"html": "https://wpnews.pro/news/infiniband-roce-and-all-that", "markdown": "https://wpnews.pro/news/infiniband-roce-and-all-that.md", "text": "https://wpnews.pro/news/infiniband-roce-and-all-that.txt", "jsonld": "https://wpnews.pro/news/infiniband-roce-and-all-that.jsonld"}}