# OpenAI, Microsoft And Friends Build A Better, More Scalable Ethernet

> Source: <https://www.nextplatform.com/connect/2026/05/12/openai-microsoft-and-friends-build-a-better-more-scalable-ethernet/5239078>
> Published: 2026-05-12 17:52:33+00:00

connect

# OpenAI, Microsoft And Friends Build A Better, More Scalable Ethernet

Sometimes, to solve a particular system architecture problem, you have to invent a new technology. And sometimes, you just need to squint at the problem a little and look at what you already have and use the parts in a different way.

The latter approach is what has happened as researchers at OpenAI, Microsoft, Broadcom, AMD, and Nvidia took a hard look at how ever-embiggening bandwidth on network ports is not necessarily a valuable thing compared to have scale out networks that have higher radix switches – meaning a lot more network links between devices – and also flatter networks with fewer switches. Lowering the switch count means the scale out network lashing together AI system nodes has lower latency (fewer hops across the network between any two endpoints), lower cost (which lowers the total cost of acquisition), and lower power consumption (which further lowers the total cost of ownership).

With
most great engineering ideas, when you look at it, the new approach is
intuitively obvious and you have to wonder why it wasn’t always done that way.
Such is the case with Multipath Reliable Connection, a new network protocol that
lays down atop Ethernet switch ASICs and that borrows many of the ideas of the
Ultra Ethernet specification put forth by the Ultra Ethernet Consortium, [which
was founded back in July 2023](https://www.nextplatform.com/connect/2023/07/20/ethernet-consortium-shoots-for-1-million-node-clusters-that-beat-infiniband/1645131) for the express purpose of scaling Ethernet
to more than 1 million endpoints as well as making it as good as the InfiniBand
low latency network for AI clusters.

The
MRC effort was started two years ago. OpenAI did a lot of the talking for the
new MRC protocol as it was unveiled last week, but we strongly suspect that
Microsoft did a lot of the work based on its extensive experience with both
RoCE Ethernet and InfiniBand networks. You can read [the OpenAI blog
about MRC here](https://openai.com/index/mrc-supercomputer-networking/), download [the
paper the five companies release there](https://arxiv.org/pdf/2605.04333), and see [the Open Compute
Project spec for the effort at this link](https://www.opencompute.org/documents/ocp-mrc-1-0-pdf).

In essence, what MRC does is stop chasing ports with higher and higher bandwidth and start using the same aggregate bandwidth of a given switch ASIC to increase the number of links between devices. I know what you’re thinking: Won’t increasing the number of ports and the number of links mean increasing the number of potential failures in those links, thereby making it more the absolutely synchronous work like an AI training run comes to a crashing halt more often? No, it won’t, if you radically increase the number of links between endpoints. If you have enough links, as it turns out, and the right protocol, you an heal around link failures and while the AI training job slows down, there are enough ways to reroute traffic that the network can heal around the link failure. And at your convenience, without having to stop the AI training job, you can repair the link.

Endpoint failures – meaning GPUs and XPUs – will still crash the training run, of course. To which we say: Why not locally snapshot checkpoints on each server node, stream them out to network storage or a shared memory appliance, keep a few spare GPUs or XPUs in the network, and restore that one failed compute engine and then resume the calculation? Perhaps this is harder than it sounds. . . . It might be better to have an out of band compute engine monitor that predicts a failure for a compute engine, freezes the training run before it crashes, takes the failing compute engine offline, loads up data on the spare compute engine, and resumes processing. Why let it crash at all?

Anyway, back to MRC. While the Ultra Ethernet protocol is a brand new protocol that starts from a blank sheet of paper to make Ethernet more like InfiniBand in terms of low latency, traffic shaping, and adaptive load balancing, the MRC protocol is much less drastic of a change and is, in fact, a superset extension of the current RDMA over Converged Ethernet (RoCE) protocol that hyperscalers, cloud builders, and supercomputing centers have been complaining about for more than a decade.

The adaptive load balancing is based on Explicit Congestion Notification, and like Ultra Ethernet, MRC supports out of order delivery of packets, packet spraying across multiple links, selective retransmission, and packet trimming to help deal with congestion.

Packet
trimming is neat in that it only retransmits packets that have been dropped due
to switch ASIC buffer overflows, and it does so without invoking the global ECN
mechanism. (Nvidia has a good explanation of packet trimming, which was
implemented in the Cumulus Linux network operating system it acquired shortly
after buying Mellanox, [here](https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-515/Layer-1-and-Switch-Ports/Quality-of-Service/Packet-Trimming/).)
While ECN tells packet senders to slow down when they are flooding a switch or
endpoint, packet trimming keeps track of the headers, drops the packet payload,
and asks the network to retransmit only missing data when a packet is dropped due
to congestion. Packet trimming requires acceleration and processing inside the
switch ASIC or the network interface card.

The new MRC protocol is paired with IPv6 segment routing, which is used to route packets in a static fashion around the network and, ironically, the dynamic routing mechanisms in the protocol stack were turned off. The combination of the adaptive load balancing and static routing across eight links per endpoint is relatively easy to do and means adaptive routing is not necessary because links and ports are not so scare and there are eight different ways to get to any endpoint.

This is accomplished by parallelizing the data plane, and this is the topology magic that makes MRC is really powerful. This is one of those cases where two pictures are literally worth a thousand words, so let’s compare and contrast how an AI cluster is built today using 51.2 Tb/sec switches today and how you do it with MRC.

For the past decade or so, if you wanted to build an AI cluster, you used a three-tier network – leaf switches in the rack, spine switches linking them together to make pods, and superspine switches for linking the pods together. This is how you can lash together an AI supercomputer with traditional RoCE Ethernet using 800 Gb/sec ports on the 51.2 Tb/sec switches, which have 64 ports each:

Each pod has 64 GPUs or XPUs, each with an 800 Gb/sec network interface. Those 64 compute engines feed into 32 Tier 0 top of rack leaf switches, which are cross-coupled with 32 Tier 1 spine switches. Each Tier 1 spine switch has 32 ports pointing up to the Tier 2 superspine switches and 32 ports pointing down to the Tier 0 leaf switches. There are 1,024 Tier 2 superspines that interlink 64 pods together, providing connectivity for a total of 65,536 GPUs or XPUs. There is a total of 5,120 64-port 51.2 Tb/sec switches in this three tier network, and they present a single Clos topology data plane.

If you want to do more GPUs than this, you have to wait for 102.4 Tb/sec switches or you have to add a fourth tier to the network. This will add cost as well as latency. With a three tier network, any GPU or XPU can be as much as five to seven switch hops away.

Now, shift to a higher radix point of view. Instead of craving 800 Gb/sec ports, split that 51.2 Tb/sec switch ASIC so it supports 512 ports running at 100 Gb/sec. And then instead of having one Clos data plane, have eight unique Clos data planes (eight unique paths between any two devices in the network). How many devices can you connect in a two tier network? Look at this:

With the same bandwidth going into each GPU or XPU (spread over eight planes instead of one), you can lash together 131,072 compute engines in a two tier network. Crazy, right? Each pod in the Ethernet MRC cluster has 256 XPUs, and they feed up to a Tier 0 leaf complex with 512 switches per plane, for a total of 4,096 switches. The Tier 1 spine layer has 256 switches per plane, for a total of 2,048 switches. Add it up, you need 6,144 switches to do the MRC network. And no GPU or XPU is more than three hops away from another.

That is 20 percent more switches to get double the compte engine capacity with the same bandwidth per compute engine. This seems like a fair trade.

There is a point in the MRC paper where it says “for full-bisection bandwidth, we require 2/3s of the optics and 3/5s the number of switches compared to a three tier network.” That is not for the comparison above, with 65,536 compute engines in a three tier network versus 131,072 compute engines in a two tier network. Those ratios are only true for the same number of compute engines in each cluster.

By the way, the three tier RoCE Ethernet network with 65,536 endpoints has a total of 196,608 links between all of the switches and compute engines, and the two tier design has an incredible 1,179,648 links. But for a two tier network with only 65,536 compute engines would have 1,048,576 links. You do copper DAC cables within the racks and optical links to the spines and superspines (if you need that layer). Given that DACs and optical links are not free, it could turn out that even though you save on the switch budget or be able to double the scale with an MRC network while only spending 20 percent more on the switch budget, you get killed on the link budget.

But, ironically, that may not matter, and here is why: Nothing, but nothing, in the whole wide world is more expensive than a GPU or XPU compute engine that can’t do work because a link fails, and if a link fails in “normal” Clos switch architectures, the whole AI supercomputer stops and you have to go back to a checkpoint and start over. So tens of thousands to more than a hundred thousand compute engines are just sitting there.

But with MRC, if you lose one of the eight links going into a GPU or XPU, you only lose 12 percent of that 800 Gb/sec bandwidth – and the AI training job keeps running. And you can go change that dead link and it will come back online; the switches will activate that link and start giving you the bandwidth back into that particular endpoint.

You can even reboot a Tier 1 switch in an MRC network, and the system heals around it, like this:

In its blog post, OpenAI said that aside from the increased scale of the AI cluster, “MRC’s adaptive packet spraying load-balances well enough that we see essentially no congestion in the core of the network. This greatly reduces variation in throughput between flows during synchronous training, where eliminating outliers is central to performance. It also means that when multiple jobs share the cluster, they do not impact one another’s performance.”

OpenAI has run MRC on clusters at the Oracle Stargate datacenter in Abilene, Texas as well as the Microsoft Azure AI datacenter in Fairwater, Wisconsin. The MRC protocol was implemented in Nvidia ConnectX-8 SmartNICs, AMD “Pollara” and “Vulcano” DPUs, and Broadcom Thor Ultra SmartNICs. The SRv6 static routing was implemented on Nvidia Spectrum 4 and Spectrum 5 switches running both the Cumulus Linux and SONiC network operating systems as well as on Arista Networks switches based on the EOS variant of Linux running atop Broadcom Tomahawk 5 ASICs.
