# nvidia-smi Reports 97% Utilization While the GPU Sits Idle

> Source: <https://dev.to/ingero/nvidia-smi-reports-97-utilization-while-the-gpu-sits-idle-20j4>
> Published: 2026-06-12 14:30:00+00:00

A GPU shows 97% utilization in `nvidia-smi`

, but training throughput is a fraction of what benchmarks promise. The GPU is not computing; it is waiting. Data loading workers are starving the training loop because CPU contention, I/O bottlenecks, or scheduling delays prevent data from arriving fast enough. Tracing the full host-to-GPU pipeline via eBPF uprobes reveals exactly where the bubble is. We investigated a case where GPU utilization numbers looked healthy but training was slow, revealing a gap between metric dashboards and actual compute efficiency.

An H100 costs $3.50/hour. PyTorch Lightning reports 200 samples/sec, but the model card says the same architecture should hit 600 samples/sec on this hardware.

Running `nvidia-smi`

:

```
+-------------------------------------------+
| GPU  Name        | GPU-Util | Memory-Usage |
|==================+==========+==============|
|   0  H100 SXM    |    97%  | 62000MiB/80GB |
+-------------------------------------------+
```

97% utilization. The GPU must be working hard, right?

Wrong. That number means "[the GPU had at least one kernel running](https://docs.nvidia.com/deploy/nvidia-smi/index.html) 97% of the time." It doesn't distinguish between:

The GPU is "utilized" the way a restaurant is "full" when one person sits at every table but nobody is eating. The kitchen (the compute cores) is idle.

This isn't hypothetical. A 100-GPU H100 cluster at 60% effective utilization (despite nvidia-smi reporting 95%+) wastes **$1.4 million per year** in capital alone. Add electricity, cooling, and engineering time debugging performance, and the number climbs past $2.5M.

[75% of organizations](https://scailium.com/insights/gpu-utilization-enterprise-ai-crisis) report GPU utilization below 70% at peak. The gap between what `nvidia-smi`

reports and actual compute efficiency is where millions of dollars disappear.

`nvidia-smi`

samples GPU state once per second. It reports a binary: "was a kernel running?" It has zero visibility into:

These are all *host-side* problems causing *GPU-side* underutilization. nvidia-smi only sees the GPU side.

The tracer traces both sides: CUDA APIs (what the GPU is doing) and host kernel events (what the CPU is doing), then builds causal chains connecting them.

``` bash
$ ingero explain --since 120s

System Context:
  CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB

Causal Chains (last 2 min):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[HIGH] CPU scheduling contention → CUDA throughput drop
  Root: 14,504 context switches on training process (PID 3821)
        Process off-CPU 62 of 120 seconds (51.7% of wall clock)
  Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)
          CUDA op throughput dropped 47% from peak (1,200 → 640 ops/sec)
  Contributing: 4 DataLoader workers + 3 background processes competing for 8 cores
  Fix: pin training to dedicated cores: taskset -c 0-3 python3 train.py
       set DataLoader persistent_workers=True
       nice -n 19 background jobs
```

There it is. The training process was **off-CPU for 51.7% of the time**. The GPU was waiting, not computing. nvidia-smi saw kernels queued and reported "97% utilized," but actual compute throughput was half of what it should be.

Using the MCP server:

**Engineer**: "Which processes caused the most scheduling contention in the last 2 minutes?"

```
SELECT
  pn.name as process,
  COUNT(*) as context_switches,
  SUM(duration_ns)/1e9 as total_off_cpu_sec,
  MAX(duration_ns)/1e6 as worst_stall_ms
FROM events e
JOIN process_names pn ON e.pid = pn.pid
WHERE op = 'sched_switch' AND timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events)
GROUP BY pn.name
ORDER BY total_off_cpu_sec DESC
LIMIT 10;
process              | switches | off_cpu_sec | worst_stall_ms
---------------------|----------|-------------|----------------
python3 (train.py)   | 14,504   | 62.0        | 790.3
pt_data_worker:0     | 8,217    | 31.4        | 609.1
pt_data_worker:1     | 7,932    | 29.8        | 642.7
pt_data_worker:2     | 8,104    | 30.1        | 611.3
pt_data_worker:3     | 7,889    | 28.9        | 587.6
prometheus-node-exp  | 3,201    | 8.7         | 45.2
fluent-bit           | 2,890    | 7.1         | 38.9
```

The training process and all 4 DataLoader workers are fighting for CPU. And the worst single stall is **790ms**, that's almost a full second where the training loop was frozen while the GPU sat idle.

Background monitoring agents (Prometheus node exporter, Fluent Bit) are stealing another 15+ seconds of CPU time.

```
SELECT
  (timestamp / 10000000000) * 10 as window_sec,
  COUNT(CASE WHEN op = 'sched_switch' THEN 1 END) as ctx_switches,
  COUNT(CASE WHEN op = 'cudaStreamSync' THEN 1 END) as sync_calls,
  AVG(CASE WHEN op = 'cudaStreamSync' THEN duration_ns END)/1000 as sync_avg_us
FROM events
WHERE timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events)
GROUP BY window_sec
ORDER BY window_sec;
window_sec | ctx_switches | sync_calls | sync_avg_us
-----------|-------------|------------|------------
0          | 342         | 89         | 52          ← baseline
10         | 1,205       | 91         | 180         ← contention starts
20         | 2,847       | 78         | 890         ← throughput drops
30         | 3,102       | 64         | 1,420       ← GPU starving
40         | 2,956       | 61         | 2,100       ← worst period
50         | 1,834       | 72         | 780         ← partial recovery
```

At the 40-second mark, context switches hit 3,000/10s and `cudaStreamSync`

average latency is 40x baseline. The GPU is doing 30% fewer sync calls, not because it's working harder, but because it has nothing to sync on. The pipeline is empty.

With `--stack`

enabled, The tracer captures exactly which Python function was on-CPU when the stall happened:

```
Top cudaStreamSync callers during contention window (t=20-50s):
  train.py:142  → cudaStreamSync | 89 calls | avg 1.8ms | max 7.2ms
    ↳ loss.backward()
  train.py:145  → cudaStreamSync | 34 calls | avg 2.1ms | max 4.9ms
    ↳ optimizer.step()

Top sched_switch victims:
  train.py:138  → DataLoader.__next__() | preempted 4,201 times
    ↳ waiting for batch from workers
```

The training loop at line 138 is blocked waiting for the next batch. The DataLoader workers themselves are being preempted. The fix is clear.

After applying fixes 1-3, the same training run:

GPU underutilization is a trillion-dollar infrastructure problem hiding behind a misleading metric. Every ML team has hit this wall: training that should take 4 hours takes 12, and nobody can explain why because all the dashboards say the GPU is "fine."

The problem is always on the host side: CPU scheduling, data loading, memory pressure, I/O contention. These are Linux kernel events. The only way to see them alongside CUDA behavior is to trace both layers simultaneously.

This is what the tracer does: eBPF uprobes on the CUDA libraries plus kernel tracepoints on the scheduler, memory subsystem, and I/O stack. No code changes, no SDK integration, <2% overhead. Production-safe.

No GPU needed to see the pattern:

```
# 1. Build
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

# 2. Try the demos
./bin/ingero demo cpu-contention     # CPU scheduling delays causing GPU stalls
./bin/ingero demo memcpy-bottleneck  # Data transfer dominating wall-clock time
```

For real GPU tracing:

```
sudo ./bin/ingero trace --stack --duration 120s
# ... run the training job ...
./bin/ingero explain --since 120s
```

**GitHub (give us a star!):** [github.com/ingero-io/ingero](https://github.com/ingero-io/ingero). No NVIDIA SDK, no code changes, production-safe by design.

We believe that your GPU utilization metrics can be misleading, and we'd love to help! ** Drop us an issue on GitHub** and we will gladly dive into it together.

*Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.*
