nvidia-smi Reports 97% Utilization While the GPU Sits Idle A developer found that `nvidia-smi` reported 97% GPU utilization on an H100 cluster while actual training throughput was less than half of expected benchmarks. Tracing via eBPF revealed the GPU was idle 51.7% of the time due to CPU scheduling contention, with the training process off-CPU for 62 seconds out of 120. The gap between reported utilization and actual compute efficiency can waste over $2.5 million annually on a 100-GPU H100 cluster. A GPU shows 97% utilization in nvidia-smi , but training throughput is a fraction of what benchmarks promise. The GPU is not computing; it is waiting. Data loading workers are starving the training loop because CPU contention, I/O bottlenecks, or scheduling delays prevent data from arriving fast enough. Tracing the full host-to-GPU pipeline via eBPF uprobes reveals exactly where the bubble is. We investigated a case where GPU utilization numbers looked healthy but training was slow, revealing a gap between metric dashboards and actual compute efficiency. An H100 costs $3.50/hour. PyTorch Lightning reports 200 samples/sec, but the model card says the same architecture should hit 600 samples/sec on this hardware. Running nvidia-smi : +-------------------------------------------+ | GPU Name | GPU-Util | Memory-Usage | |==================+==========+==============| | 0 H100 SXM | 97% | 62000MiB/80GB | +-------------------------------------------+ 97% utilization. The GPU must be working hard, right? Wrong. That number means " the GPU had at least one kernel running https://docs.nvidia.com/deploy/nvidia-smi/index.html 97% of the time." It doesn't distinguish between: The GPU is "utilized" the way a restaurant is "full" when one person sits at every table but nobody is eating. The kitchen the compute cores is idle. This isn't hypothetical. A 100-GPU H100 cluster at 60% effective utilization despite nvidia-smi reporting 95%+ wastes $1.4 million per year in capital alone. Add electricity, cooling, and engineering time debugging performance, and the number climbs past $2.5M. 75% of organizations https://scailium.com/insights/gpu-utilization-enterprise-ai-crisis report GPU utilization below 70% at peak. The gap between what nvidia-smi reports and actual compute efficiency is where millions of dollars disappear. nvidia-smi samples GPU state once per second. It reports a binary: "was a kernel running?" It has zero visibility into: These are all host-side problems causing GPU-side underutilization. nvidia-smi only sees the GPU side. The tracer traces both sides: CUDA APIs what the GPU is doing and host kernel events what the CPU is doing , then builds causal chains connecting them. bash $ ingero explain --since 120s System Context: CPU: 94.2% | Memory: 78.1% | Load: 12.3 8 cores | Swap: 0 MB Causal Chains last 2 min : ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ HIGH CPU scheduling contention → CUDA throughput drop Root: 14,504 context switches on training process PID 3821 Process off-CPU 62 of 120 seconds 51.7% of wall clock Effect: cudaStreamSync p99 inflated 1,028x 7µs → 7.2ms CUDA op throughput dropped 47% from peak 1,200 → 640 ops/sec Contributing: 4 DataLoader workers + 3 background processes competing for 8 cores Fix: pin training to dedicated cores: taskset -c 0-3 python3 train.py set DataLoader persistent workers=True nice -n 19 background jobs There it is. The training process was off-CPU for 51.7% of the time . The GPU was waiting, not computing. nvidia-smi saw kernels queued and reported "97% utilized," but actual compute throughput was half of what it should be. Using the MCP server: Engineer : "Which processes caused the most scheduling contention in the last 2 minutes?" SELECT pn.name as process, COUNT as context switches, SUM duration ns /1e9 as total off cpu sec, MAX duration ns /1e6 as worst stall ms FROM events e JOIN process names pn ON e.pid = pn.pid WHERE op = 'sched switch' AND timestamp SELECT MAX timestamp - 120000000000 FROM events GROUP BY pn.name ORDER BY total off cpu sec DESC LIMIT 10; process | switches | off cpu sec | worst stall ms ---------------------|----------|-------------|---------------- python3 train.py | 14,504 | 62.0 | 790.3 pt data worker:0 | 8,217 | 31.4 | 609.1 pt data worker:1 | 7,932 | 29.8 | 642.7 pt data worker:2 | 8,104 | 30.1 | 611.3 pt data worker:3 | 7,889 | 28.9 | 587.6 prometheus-node-exp | 3,201 | 8.7 | 45.2 fluent-bit | 2,890 | 7.1 | 38.9 The training process and all 4 DataLoader workers are fighting for CPU. And the worst single stall is 790ms , that's almost a full second where the training loop was frozen while the GPU sat idle. Background monitoring agents Prometheus node exporter, Fluent Bit are stealing another 15+ seconds of CPU time. SELECT timestamp / 10000000000 10 as window sec, COUNT CASE WHEN op = 'sched switch' THEN 1 END as ctx switches, COUNT CASE WHEN op = 'cudaStreamSync' THEN 1 END as sync calls, AVG CASE WHEN op = 'cudaStreamSync' THEN duration ns END /1000 as sync avg us FROM events WHERE timestamp SELECT MAX timestamp - 120000000000 FROM events GROUP BY window sec ORDER BY window sec; window sec | ctx switches | sync calls | sync avg us -----------|-------------|------------|------------ 0 | 342 | 89 | 52 ← baseline 10 | 1,205 | 91 | 180 ← contention starts 20 | 2,847 | 78 | 890 ← throughput drops 30 | 3,102 | 64 | 1,420 ← GPU starving 40 | 2,956 | 61 | 2,100 ← worst period 50 | 1,834 | 72 | 780 ← partial recovery At the 40-second mark, context switches hit 3,000/10s and cudaStreamSync average latency is 40x baseline. The GPU is doing 30% fewer sync calls, not because it's working harder, but because it has nothing to sync on. The pipeline is empty. With --stack enabled, The tracer captures exactly which Python function was on-CPU when the stall happened: Top cudaStreamSync callers during contention window t=20-50s : train.py:142 → cudaStreamSync | 89 calls | avg 1.8ms | max 7.2ms ↳ loss.backward train.py:145 → cudaStreamSync | 34 calls | avg 2.1ms | max 4.9ms ↳ optimizer.step Top sched switch victims: train.py:138 → DataLoader. next | preempted 4,201 times ↳ waiting for batch from workers The training loop at line 138 is blocked waiting for the next batch. The DataLoader workers themselves are being preempted. The fix is clear. After applying fixes 1-3, the same training run: GPU underutilization is a trillion-dollar infrastructure problem hiding behind a misleading metric. Every ML team has hit this wall: training that should take 4 hours takes 12, and nobody can explain why because all the dashboards say the GPU is "fine." The problem is always on the host side: CPU scheduling, data loading, memory pressure, I/O contention. These are Linux kernel events. The only way to see them alongside CUDA behavior is to trace both layers simultaneously. This is what the tracer does: eBPF uprobes on the CUDA libraries plus kernel tracepoints on the scheduler, memory subsystem, and I/O stack. No code changes, no SDK integration, <2% overhead. Production-safe. No GPU needed to see the pattern: 1. Build git clone https://github.com/ingero-io/ingero.git cd ingero && make build 2. Try the demos ./bin/ingero demo cpu-contention CPU scheduling delays causing GPU stalls ./bin/ingero demo memcpy-bottleneck Data transfer dominating wall-clock time For real GPU tracing: sudo ./bin/ingero trace --stack --duration 120s ... run the training job ... ./bin/ingero explain --since 120s GitHub give us a star : github.com/ingero-io/ingero https://github.com/ingero-io/ingero . No NVIDIA SDK, no code changes, production-safe by design. We believe that your GPU utilization metrics can be misleading, and we'd love to help Drop us an issue on GitHub and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 user-space + GPL-2.0/BSD-3 eBPF kernel-space . One binary, zero dependencies, <2% overhead.