nvidia-smi Reports 97% Utilization While the GPU Sits Idle

wpnews.pro

A GPU shows 97% utilization in nvidia-smi

, but training throughput is a fraction of what benchmarks promise. The GPU is not computing; it is waiting. Data workers are starving the training loop because CPU contention, I/O bottlenecks, or scheduling delays prevent data from arriving fast enough. Tracing the full host-to-GPU pipeline via eBPF uprobes reveals exactly where the bubble is. We investigated a case where GPU utilization numbers looked healthy but training was slow, revealing a gap between metric dashboards and actual compute efficiency.

An H100 costs $3.50/hour. PyTorch Lightning reports 200 samples/sec, but the model card says the same architecture should hit 600 samples/sec on this hardware.

Running nvidia-smi

:

+-------------------------------------------+
| GPU  Name        | GPU-Util | Memory-Usage |
|==================+==========+==============|
|   0  H100 SXM    |    97%  | 62000MiB/80GB |
+-------------------------------------------+

97% utilization. The GPU must be working hard, right?

Wrong. That number means "the GPU had at least one kernel running 97% of the time." It doesn't distinguish between:

The GPU is "utilized" the way a restaurant is "full" when one person sits at every table but nobody is eating. The kitchen (the compute cores) is idle.

This isn't hypothetical. A 100-GPU H100 cluster at 60% effective utilization (despite nvidia-smi reporting 95%+) wastes $1.4 million per year in capital alone. Add electricity, cooling, and engineering time debugging performance, and the number climbs past $2.5M.

75% of organizations report GPU utilization below 70% at peak. The gap between what nvidia-smi

reports and actual compute efficiency is where millions of dollars disappear.

nvidia-smi

samples GPU state once per second. It reports a binary: "was a kernel running?" It has zero visibility into:

These are all host-side problems causing GPU-side underutilization. nvidia-smi only sees the GPU side.

The tracer traces both sides: CUDA APIs (what the GPU is doing) and host kernel events (what the CPU is doing), then builds causal chains connecting them.

$ ingero explain --since 120s

System Context:
  CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB

Causal Chains (last 2 min):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[HIGH] CPU scheduling contention → CUDA throughput drop
  Root: 14,504 context switches on training process (PID 3821)
        Process off-CPU 62 of 120 seconds (51.7% of wall clock)
  Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)
          CUDA op throughput dropped 47% from peak (1,200 → 640 ops/sec)
  Contributing: 4 Data workers + 3 background processes competing for 8 cores
  Fix: pin training to dedicated cores: taskset -c 0-3 python3 train.py
       set Data persistent_workers=True
       nice -n 19 background jobs

There it is. The training process was off-CPU for 51.7% of the time. The GPU was waiting, not computing. nvidia-smi saw kernels queued and reported "97% utilized," but actual compute throughput was half of what it should be.

Using the MCP server:

Engineer: "Which processes caused the most scheduling contention in the last 2 minutes?"

SELECT
  pn.name as process,
  COUNT(*) as context_switches,
  SUM(duration_ns)/1e9 as total_off_cpu_sec,
  MAX(duration_ns)/1e6 as worst_stall_ms
FROM events e
JOIN process_names pn ON e.pid = pn.pid
WHERE op = 'sched_switch' AND timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events)
GROUP BY pn.name
ORDER BY total_off_cpu_sec DESC
LIMIT 10;
process              | switches | off_cpu_sec | worst_stall_ms
---------------------|----------|-------------|----------------
python3 (train.py)   | 14,504   | 62.0        | 790.3
pt_data_worker:0     | 8,217    | 31.4        | 609.1
pt_data_worker:1     | 7,932    | 29.8        | 642.7
pt_data_worker:2     | 8,104    | 30.1        | 611.3
pt_data_worker:3     | 7,889    | 28.9        | 587.6
prometheus-node-exp  | 3,201    | 8.7         | 45.2
fluent-bit           | 2,890    | 7.1         | 38.9

The training process and all 4 Data workers are fighting for CPU. And the worst single stall is 790ms, that's almost a full second where the training loop was frozen while the GPU sat idle.

Background monitoring agents (Prometheus node exporter, Fluent Bit) are stealing another 15+ seconds of CPU time.

SELECT
  (timestamp / 10000000000) * 10 as window_sec,
  COUNT(CASE WHEN op = 'sched_switch' THEN 1 END) as ctx_switches,
  COUNT(CASE WHEN op = 'cudaStreamSync' THEN 1 END) as sync_calls,
  AVG(CASE WHEN op = 'cudaStreamSync' THEN duration_ns END)/1000 as sync_avg_us
FROM events
WHERE timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events)
GROUP BY window_sec
ORDER BY window_sec;
window_sec | ctx_switches | sync_calls | sync_avg_us
-----------|-------------|------------|------------
0          | 342         | 89         | 52          ← baseline
10         | 1,205       | 91         | 180         ← contention starts
20         | 2,847       | 78         | 890         ← throughput drops
30         | 3,102       | 64         | 1,420       ← GPU starving
40         | 2,956       | 61         | 2,100       ← worst period
50         | 1,834       | 72         | 780         ← partial recovery

At the 40-second mark, context switches hit 3,000/10s and cudaStreamSync

average latency is 40x baseline. The GPU is doing 30% fewer sync calls, not because it's working harder, but because it has nothing to sync on. The pipeline is empty.

With --stack

enabled, The tracer captures exactly which Python function was on-CPU when the stall happened:

Top cudaStreamSync callers during contention window (t=20-50s):
  train.py:142  → cudaStreamSync | 89 calls | avg 1.8ms | max 7.2ms
    ↳ loss.backward()
  train.py:145  → cudaStreamSync | 34 calls | avg 2.1ms | max 4.9ms
    ↳ optimizer.step()

Top sched_switch victims:
  train.py:138  → Data.__next__() | preempted 4,201 times
    ↳ waiting for batch from workers

The training loop at line 138 is blocked waiting for the next batch. The Data workers themselves are being preempted. The fix is clear.

After applying fixes 1-3, the same training run:

GPU underutilization is a trillion-dollar infrastructure problem hiding behind a misleading metric. Every ML team has hit this wall: training that should take 4 hours takes 12, and nobody can explain why because all the dashboards say the GPU is "fine."

The problem is always on the host side: CPU scheduling, data , memory pressure, I/O contention. These are Linux kernel events. The only way to see them alongside CUDA behavior is to trace both layers simultaneously.

This is what the tracer does: eBPF uprobes on the CUDA libraries plus kernel tracepoints on the scheduler, memory subsystem, and I/O stack. No code changes, no SDK integration, <2% overhead. Production-safe.

No GPU needed to see the pattern:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

./bin/ingero demo cpu-contention     # CPU scheduling delays causing GPU stalls
./bin/ingero demo memcpy-bottleneck  # Data transfer dominating wall-clock time

For real GPU tracing:

sudo ./bin/ingero trace --stack --duration 120s
./bin/ingero explain --since 120s

GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

We believe that your GPU utilization metrics can be misleading, and we'd love to help! ** Drop us an issue on GitHub** and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

source & further reading

dev.to — original article Your AI Agent Can't Connect Through a Corporate Firewall? Here's the Debugging Checklist Why Winning Hackathon Teams Plan Their Data Before Their Screens How fast do you finish your usage on Claude Code?

nvidia-smi Reports 97% Utilization While the GPU Sits Idle

Run your AI side-project on zahid.host