AllReduce Stalls Are Network Stalls. Most Tools See Neither.

A developer has built an open-source agent that correlates NCCL AllReduce stalls with TCP retransmits on the same host, revealing network bottlenecks that GPU dashboards miss. The tool attaches uprobes to NCCL collective APIs and tracepoints on TCP and the scheduler, then joins the two data layers on host, PID, and timestamp at query time. A single SQL query can identify that a slow AllReduce on rank 5 coincided with three TCP retransmits on that rank's NIC, exposing network stalls that NVML and DCGM would report as a busy GPU.

A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the collective completes. When a multi-node training job slows down on AllReduce, both ends of the evidence are below GPU-counter dashboards: the libnccl call surface which rank initiated, when, with what arguments and the kernel TCP path which connection retransmitted, by how much, on whose NIC . The agent ships uprobes on the NCCL public API and tracepoints on TCP and the scheduler. The two layers join on host, pid, timestamp at query time. On the GPU side, an AllReduce in flight looks like the GPU is busy. Compute kernels are queued behind the collective. The util counter reports high. The collective is waiting for peer ranks; the SMs are not doing useful arithmetic. NVML https://docs.nvidia.com/deploy/nvml-api/group nvmlDeviceQueries.html sees a busy device. DCGM sees a busy device. The training step time goes up. The dashboard does not change. The NCCL public API is small and well-named. The agent attaches uprobes on ncclAllReduce , ncclAllGather , ncclReduceScatter , ncclBcast , ncclSend , and ncclRecv , plus the lifecycle hooks ncclCommInitRank , ncclCommInitAll , ncclCommDestroy . At the entry of each collective, the probe stashes the rank, communicator pointer, datatype, reduce-op, count, and stream. At the return, it folds the captured timestamp into a duration and emits one event with rank, nranks, and a communicator-id hash attached. The communicator-id hash is the full 128-byte ncclUniqueId folded with splitmix64, not just the first 8 bytes. Distinct communicators that happen to share the NCCL magic-and-version header very common get distinct ids in the trace. On the same host, the agent attaches to tcp:tcp retransmit skb and the scheduler tracepoints. A retransmit on an inter-node connection is the most common cause of a slow AllReduce that has nothing to do with the GPU. The trace records the retransmit timestamp, the saddr/daddr, and the sequence number. Joining that against the libnccl AllReduce-in-flight events on cgroup id, time-window returns the TCP-side reason for a slow collective. -- find slow ncclAllReduce calls and any TCP retransmits inside their window WITH slow collectives AS SELECT timestamp ns, duration ns, rank, nranks, comm id hash, pid FROM nccl events WHERE op = 'ALL REDUCE' AND duration ns 50000000 -- 50ms SELECT s.rank, s.duration ns/1e6 AS ms, COUNT t.timestamp ns AS retransmits in window FROM slow collectives s LEFT JOIN tcp events t ON t.timestamp ns BETWEEN s.timestamp ns AND s.timestamp ns + s.duration ns AND t.event = 'tcp retransmit skb' GROUP BY s.rank, s.duration ns, s.timestamp ns ORDER BY ms DESC LIMIT 20; That query returns “rank 5’s AllReduce took 187 ms and saw 3 TCP retransmits during its window”. Two layers, one join, one answer. 1. install curl -fsSL https://github.com/ingero-io/ingero/releases/latest/download/install.sh | sh 2. start a workload using NCCL on this host PyTorch DDP, vLLM TP, etc. 3. capture for the duration of one training epoch or one inference window ingero trace --duration 2m --out /tmp/nccl.db 4. inspect collectives ingero query /tmp/nccl.db \ "SELECT op, rank, nranks, duration ns/1e6 AS ms FROM nccl events ORDER BY duration ns DESC LIMIT 20" 5. check whether slow collectives line up with TCP retransmits ingero query /tmp/nccl.db \ "SELECT COUNT FROM tcp events WHERE event = 'tcp retransmit skb'" A clean run shows zero retransmits and AllReduce durations clustered near each other. A bad rail or a noisy NIC shows up as one rank with higher AllReduce p99 and a non-zero retransmit count in the same window. Multi-node GPU performance is bottlenecked on the network more often than on compute. The reason that fact does not show up clearly is that most observability tools draw a line between “GPU monitoring” counters and “network monitoring” a different team’s dashboard . At the kernel level there is no such line. libnccl calls and tcp retransmit skb events live in the same trace database and join on the same timestamp. Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. GitHub ⭐ https://github.com/ingero-io/ingero · Open an issue if you are running multi-node training or distributed inference and want one agent that catches both the libnccl call surface and the kernel TCP path.