A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the collective completes.
When a multi-node training job slows down on AllReduce, both ends of the evidence are below GPU-counter dashboards: the libnccl call surface (which rank initiated, when, with what arguments) and the kernel TCP path (which connection retransmitted, by how much, on whose NIC). The agent ships uprobes on the NCCL public API and tracepoints on TCP and the scheduler. The two layers join on (host, pid, timestamp) at query time.
On the GPU side, an AllReduce in flight looks like the GPU is busy. Compute kernels are queued behind the collective. The util counter reports high. The collective is waiting for peer ranks; the SMs are not doing useful arithmetic. NVML sees a busy device. DCGM sees a busy device. The training step time goes up. The dashboard does not change.
The NCCL public API is small and well-named. The agent attaches uprobes on ncclAllReduce
, ncclAllGather
, ncclReduceScatter
, ncclBcast
, ncclSend
, and ncclRecv
, plus the lifecycle hooks (ncclCommInitRank
, ncclCommInitAll
, ncclCommDestroy
). At the entry of each collective, the probe stashes the rank, communicator pointer, datatype, reduce-op, count, and stream. At the return, it folds the captured timestamp into a duration and emits one event with rank, nranks, and a communicator-id hash attached.
The communicator-id hash is the full 128-byte ncclUniqueId folded with splitmix64, not just the first 8 bytes. Distinct communicators that happen to share the NCCL magic-and-version header (very common) get distinct ids in the trace.
On the same host, the agent attaches to tcp:tcp_retransmit_skb
and the scheduler tracepoints. A retransmit on an inter-node connection is the most common cause of a slow AllReduce that has nothing to do with the GPU. The trace records the retransmit timestamp, the saddr/daddr, and the sequence number. Joining that against the libnccl AllReduce-in-flight events on (cgroup_id, time-window) returns the TCP-side reason for a slow collective.
-- find slow ncclAllReduce calls and any TCP retransmits inside their window
WITH slow_collectives AS (
SELECT timestamp_ns, duration_ns, rank, nranks, comm_id_hash, pid
FROM nccl_events
WHERE op = 'ALL_REDUCE'
AND duration_ns > 50000000 -- > 50ms
)
SELECT s.rank, s.duration_ns/1e6 AS ms,
COUNT(t.timestamp_ns) AS retransmits_in_window
FROM slow_collectives s
LEFT JOIN tcp_events t
ON t.timestamp_ns BETWEEN s.timestamp_ns
AND s.timestamp_ns + s.duration_ns
AND t.event = 'tcp_retransmit_skb'
GROUP BY s.rank, s.duration_ns, s.timestamp_ns
ORDER BY ms DESC
LIMIT 20;
That query returns “rank 5’s AllReduce took 187 ms and saw 3 TCP retransmits during its window”. Two layers, one join, one answer.
curl -fsSL https://github.com/ingero-io/ingero/releases/latest/download/install.sh | sh
ingero trace --duration 2m --out /tmp/nccl.db
ingero query /tmp/nccl.db \
"SELECT op, rank, nranks, duration_ns/1e6 AS ms
FROM nccl_events ORDER BY duration_ns DESC LIMIT 20"
ingero query /tmp/nccl.db \
"SELECT COUNT(*) FROM tcp_events
WHERE event = 'tcp_retransmit_skb'"
A clean run shows zero retransmits and AllReduce durations clustered near each other. A bad rail or a noisy NIC shows up as one rank with higher AllReduce p99 and a non-zero retransmit count in the same window.
Multi-node GPU performance is bottlenecked on the network more often than on compute. The reason that fact does not show up clearly is that most observability tools draw a line between “GPU monitoring” (counters) and “network monitoring” (a different team’s dashboard). At the kernel level there is no such line. libnccl
calls and tcp_retransmit_skb
events live in the same trace database and join on the same timestamp.
Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. GitHub ⭐ · ** Open an issue* if you are running multi-node training or distributed inference and want one agent that catches both the libnccl call surface and the kernel TCP path.*