cd /news/developer-tools/rdmatop-cross-provider-htop-for-rdma… · home topics developer-tools article
[ARTICLE · art-41976] src=uccl-project.github.io ↗ pub= topic=developer-tools verified=true sentiment=↑ positive

rdmatop: Cross-Provider Htop for RDMA Traffic

The UCCL team released rdmatop, a real-time terminal UI that monitors RDMA traffic across any Linux device including NVIDIA ConnectX, AWS EFA, and Broadcom NICs. The tool reads RDMA netlink to provide per-device throughput, per-process queue pair mapping, and Tx/Rx visibility, addressing the lack of cross-provider monitoring in existing tools like ibtop. Case studies show it can quickly diagnose NCCL falling back to TCP sockets and other performance bottlenecks.

read7 min views1 publishedJun 27, 2026
rdmatop: Cross-Provider Htop for RDMA Traffic
Image: source

By: Chang-Ning Tsai and the UCCL Team — June 15, 2026

RDMA is the backbone of multi-node LLM training and inference, yet most of us run it blind—when throughput is half what it should be, it is hard to see which NIC is hot, which is idle, or whether the bottleneck is on transmit or receive. We built ** rdmatop**, "htop, but for RDMA traffic": a real-time TUI that monitors any Linux RDMA device (NVIDIA ConnectX, AWS EFA, Broadcom) through RDMA netlink. We then walk through real NCCL and NVSHMEM cases where a per-NIC, per-process view made the problem obvious at a glance.

Introduction

If you run InfiniBand fabrics, you have probably used ibtop—a small but invaluable tool that reads InfiniBand hardware performance counters (via the UMAD interface) and organizes bandwidth and traffic by job or host. It answers the everyday operational question:

who is using the fabric, and how much?

The trouble is that the RDMA world is no longer just InfiniBand. GPU clusters today run RDMA over an expanding set of providers—NVIDIA/Mellanox ConnectX (RoCE and InfiniBand), AWS EFA, Broadcom Thor/bnxt

, AMD Pensando/Pollara—each with its own NIC, counter definitions, and quirks. An InfiniBand-only tool like ibtop

cannot see any of these, and writing a separate monitor per vendor does not scale. What practitioners actually need is a provider-agnostic view of RDMA traffic.

That is exactly what rdmatop

provides. Instead of per-vendor counters, it reads RDMA netlink—the same interface behind the rdma statistic

command—so it works on any Linux RDMA device, and it maps queue pairs (QPs) back to the processes that own them. The result is a live terminal dashboard of per-device throughput (Gb/s, packets/s, drops), RDMA read/write counters, retransmissions, and—crucially—which process is driving each device. That per-NIC, per-process, Tx-vs-Rx visibility is what turns “the job is slow” into “GPU 0’s traffic is all landing on a single NIC.”

Installation

rdmatop

is a single static binary with no daemon and no cluster to stand up—you can have it running in under a minute. On Ubuntu, install it from our PPA:

sudo add-apt-repository ppa:crazyguitar/rdmatop
sudo apt update
sudo apt install rdmatop

Or, on any platform with a Rust toolchain, install it straight from crates.io:

cargo install rdmatop

Then run rdmatop

on any node with RDMA devices and the live per-NIC view comes up right away. The case studies below are the kind of problem it makes obvious at a glance.

Case Study 1: AWS Already Has an EFA Exporter—So Why a TUI?

AWS does provide an example: its distributed-training repo (for EKS and SageMaker HyperPod) documents an EFA node exporter that scrapes EFA traffic into Prometheus and Grafana for fleet-wide dashboards. Deploying that exporter is not always convenient, though—on a Slurm cluster, or any node you simply SSH into, there is usually no Prometheus/Grafana stack, and bringing one up just to inspect a single host is a lot of moving parts for a quick look.

That is the gap rdmatop

fills: a single binary, no cluster and no Grafana, showing live per-NIC, per-process Tx/Rx rates the moment you run it on the node. The case studies below show what that immediacy buys.

Case Study 2: NCCL Silently Falling Back to TCP Sockets

NCCL is the default collective library for distributed training and inference, and on EFA it should move data over RDMA through the libfabric (OFI) plugin. If that plugin is mislinked or misconfigured, NCCL silently falls back to kernel TCP sockets—RDMA disabled—and collective throughput can crater by up to an order of magnitude (~10×). The job still runs and converges; it is just far slower.

The only clue is one line in the NCCL_DEBUG=INFO

output:

NCCL INFO Using network Socket

NCCL INFO Using network Libfabric

In a multi-node training run or a hosted inference service, nobody is watching initialization logs, and the log volume buries that one line (see uccl#734). rdmatop

surfaces the fallback instantly: on sockets, the EFA NICs show near-zero RDMA traffic even while the GPUs are clearly communicating. Flat RDMA counters mean you are not on RDMA—no log archaeology required.

Case Study 3: NVSHMEM ≤ 3.5.21 Silently Used Only One of Many EFA NICs

AWS GPU instances ship with multiple EFA NICs per node so each GPU can drive more network bandwidth—but for a long time, NVSHMEM 1 could not use them all.

In NVSHMEM 3.5.21 and earlier, the libfabric transport bound each GPU to a single EFA NIC, capping its point-to-point throughput at one NIC’s bandwidth and leaving the rest of an expensive multi-NIC system idle. Workloads looked mysteriously slow, with no hint why at the application level.

Figure 1: NVSHMEM 3.5.21—only a few EFA NICs carry traffic; the rest sit at 0.00.

An RDMA monitor makes this unambiguous: rdmatop

shows one EFA NIC pinned near line rate while its siblings sit at zero—no theory, no guesswork. (For the full single- vs. multi-NIC write-up, see the NVSHMEM Multi-NIC notes. 2)

Case Study 4: Multi-Rail Was Added, But Throughput Did Not Scale

NVSHMEM 3.6.5 added round-robin NIC selection so a single GPU could spray traffic across all its EFA NICs. With four NICs we expected throughput to scale roughly —but all-to-all refused to, sometimes coming out slower than a single NIC.

rdmatop

on the destination node made the cause obvious: transmit (Tx) traffic spread evenly across all NICs, but receive (Rx) traffic funneled onto one NIC. Round-robin balanced sends but not receives—every sender picked the same remote NIC for a given destination—and that lone receive-side hotspot capped the job.

Figure 2: NVSHMEM 3.6.5 multi-rail—Tx spreads across all NICs, but Rx funnels onto a few.

The fix in NVIDIA/nvshmem#76 spreads remote-NIC selection per sender, so receives land on different NICs and throughput scales as expected; the PR has the details and benchmarks.

Case Study 5: Try It Yourself with the Bundled Examples

You do not need a broken cluster to see what rdmatop

shows. The repo ships ready-to-run examples that generate RDMA traffic across the frameworks people actually use—ib

and rdma

verbs microbenchmarks, ucx

, nccl

, nvshmem

, nixl

, and pplx

—plus deployment recipes for real clusters.

On Kubernetes, run rdmatop

as a DaemonSet so every GPU node is covered, then attach to any pod’s TUI:

kubectl apply -f examples/kubernetes/daemonset.yaml
kubectl exec -it <rdmatop-pod> -- rdmatop

The DaemonSet runs with hostNetwork

, hostPID

, and the NET_ADMIN

capability so it can read host RDMA devices and map queue pairs to the processes that own them.

On Slurm, submit your job, then open an interactive shell on one of its allocated nodes and watch the traffic live:

srun --jobid=$JOBID --overlap --pty bash   # hop onto a running job's node
rdmatop

Beyond debugging, this is how you tell whether a workload is compute-bound or communication-bound. In prefill–decode (PD) disaggregation, for example, the KV cache streams over RDMA from prefill to decode GPUs: if rdmatop

shows those NICs saturated, the transfer is your bottleneck; if they sit near idle while the GPUs stay busy, the network is not what is holding you back. It is the fastest way to learn both the tool and your workload before you need it in production.

Conclusion

The debugging cases above share a theme: the hardware was capable, but it was being used wrong—idle because traffic fell back to TCP, capped to a single rail, or funneled onto one NIC—and each failure was effectively invisible at the application layer. The job ran; it was simply slow. Each took real investigation to track down.

A per-NIC, per-process, Tx-vs-Rx monitor collapses that investigation into a glance. As RDMA fans out across EFA, ConnectX, Broadcom, and AMD, a tool that reads every one of them through a single vendor-neutral interface becomes essential rather than nice-to-have. That is the gap rdmatop

is built to fill—htop

, but for RDMA traffic. We welcome issues and contributions.

── more in #developer-tools 4 stories · sorted by recency
── more on @uccl 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/rdmatop-cross-provid…] indexed:0 read:7min 2026-06-27 ·