cd /news/ai-infrastructure/allreduce-stalls-are-network-stalls-… · home topics ai-infrastructure article
[ARTICLE · art-15306] src=dev.to pub= topic=ai-infrastructure verified=true sentiment=· neutral

AllReduce Stalls Are Network Stalls. Most Tools See Neither.

A developer has built an open-source agent that correlates NCCL AllReduce stalls with TCP retransmits on the same host, revealing network bottlenecks that GPU dashboards miss. The tool attaches uprobes to NCCL collective APIs and tracepoints on TCP and the scheduler, then joins the two data layers on host, PID, and timestamp at query time. A single SQL query can identify that a slow AllReduce on rank 5 coincided with three TCP retransmits on that rank's NIC, exposing network stalls that NVML and DCGM would report as a busy GPU.

read3 min publishedMay 27, 2026

A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the collective completes.

When a multi-node training job slows down on AllReduce, both ends of the evidence are below GPU-counter dashboards: the libnccl call surface (which rank initiated, when, with what arguments) and the kernel TCP path (which connection retransmitted, by how much, on whose NIC). The agent ships uprobes on the NCCL public API and tracepoints on TCP and the scheduler. The two layers join on (host, pid, timestamp) at query time.

On the GPU side, an AllReduce in flight looks like the GPU is busy. Compute kernels are queued behind the collective. The util counter reports high. The collective is waiting for peer ranks; the SMs are not doing useful arithmetic. NVML sees a busy device. DCGM sees a busy device. The training step time goes up. The dashboard does not change.

The NCCL public API is small and well-named. The agent attaches uprobes on ncclAllReduce

, ncclAllGather

, ncclReduceScatter

, ncclBcast

, ncclSend

, and ncclRecv

, plus the lifecycle hooks (ncclCommInitRank

, ncclCommInitAll

, ncclCommDestroy

). At the entry of each collective, the probe stashes the rank, communicator pointer, datatype, reduce-op, count, and stream. At the return, it folds the captured timestamp into a duration and emits one event with rank, nranks, and a communicator-id hash attached.

The communicator-id hash is the full 128-byte ncclUniqueId folded with splitmix64, not just the first 8 bytes. Distinct communicators that happen to share the NCCL magic-and-version header (very common) get distinct ids in the trace.

On the same host, the agent attaches to tcp:tcp_retransmit_skb

and the scheduler tracepoints. A retransmit on an inter-node connection is the most common cause of a slow AllReduce that has nothing to do with the GPU. The trace records the retransmit timestamp, the saddr/daddr, and the sequence number. Joining that against the libnccl AllReduce-in-flight events on (cgroup_id, time-window) returns the TCP-side reason for a slow collective.

-- find slow ncclAllReduce calls and any TCP retransmits inside their window
WITH slow_collectives AS (
  SELECT timestamp_ns, duration_ns, rank, nranks, comm_id_hash, pid
    FROM nccl_events
   WHERE op = 'ALL_REDUCE'
     AND duration_ns > 50000000   -- > 50ms
)
SELECT s.rank, s.duration_ns/1e6 AS ms,
       COUNT(t.timestamp_ns) AS retransmits_in_window
  FROM slow_collectives s
  LEFT JOIN tcp_events t
    ON t.timestamp_ns BETWEEN s.timestamp_ns
                         AND s.timestamp_ns + s.duration_ns
   AND t.event = 'tcp_retransmit_skb'
 GROUP BY s.rank, s.duration_ns, s.timestamp_ns
 ORDER BY ms DESC
 LIMIT 20;

That query returns “rank 5’s AllReduce took 187 ms and saw 3 TCP retransmits during its window”. Two layers, one join, one answer.

curl -fsSL https://github.com/ingero-io/ingero/releases/latest/download/install.sh | sh

ingero trace --duration 2m --out /tmp/nccl.db

ingero query /tmp/nccl.db \
  "SELECT op, rank, nranks, duration_ns/1e6 AS ms
     FROM nccl_events ORDER BY duration_ns DESC LIMIT 20"

ingero query /tmp/nccl.db \
  "SELECT COUNT(*) FROM tcp_events
     WHERE event = 'tcp_retransmit_skb'"

A clean run shows zero retransmits and AllReduce durations clustered near each other. A bad rail or a noisy NIC shows up as one rank with higher AllReduce p99 and a non-zero retransmit count in the same window.

Multi-node GPU performance is bottlenecked on the network more often than on compute. The reason that fact does not show up clearly is that most observability tools draw a line between “GPU monitoring” (counters) and “network monitoring” (a different team’s dashboard). At the kernel level there is no such line. libnccl

calls and tcp_retransmit_skb

events live in the same trace database and join on the same timestamp.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. GitHub ⭐ · ** Open an issue* if you are running multi-node training or distributed inference and want one agent that catches both the libnccl call surface and the kernel TCP path.*

── more in #ai-infrastructure 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/allreduce-stalls-are…] indexed:0 read:3min 2026-05-27 ·