{"slug": "nvidia-smi-reports-97-utilization-while-the-gpu-sits-idle", "title": "nvidia-smi Reports 97% Utilization While the GPU Sits Idle", "summary": "A developer found that `nvidia-smi` reported 97% GPU utilization on an H100 cluster while actual training throughput was less than half of expected benchmarks. Tracing via eBPF revealed the GPU was idle 51.7% of the time due to CPU scheduling contention, with the training process off-CPU for 62 seconds out of 120. The gap between reported utilization and actual compute efficiency can waste over $2.5 million annually on a 100-GPU H100 cluster.", "body_md": "A GPU shows 97% utilization in `nvidia-smi`\n\n, but training throughput is a fraction of what benchmarks promise. The GPU is not computing; it is waiting. Data loading workers are starving the training loop because CPU contention, I/O bottlenecks, or scheduling delays prevent data from arriving fast enough. Tracing the full host-to-GPU pipeline via eBPF uprobes reveals exactly where the bubble is. We investigated a case where GPU utilization numbers looked healthy but training was slow, revealing a gap between metric dashboards and actual compute efficiency.\n\nAn H100 costs $3.50/hour. PyTorch Lightning reports 200 samples/sec, but the model card says the same architecture should hit 600 samples/sec on this hardware.\n\nRunning `nvidia-smi`\n\n:\n\n```\n+-------------------------------------------+\n| GPU  Name        | GPU-Util | Memory-Usage |\n|==================+==========+==============|\n|   0  H100 SXM    |    97%  | 62000MiB/80GB |\n+-------------------------------------------+\n```\n\n97% utilization. The GPU must be working hard, right?\n\nWrong. That number means \"[the GPU had at least one kernel running](https://docs.nvidia.com/deploy/nvidia-smi/index.html) 97% of the time.\" It doesn't distinguish between:\n\nThe GPU is \"utilized\" the way a restaurant is \"full\" when one person sits at every table but nobody is eating. The kitchen (the compute cores) is idle.\n\nThis isn't hypothetical. A 100-GPU H100 cluster at 60% effective utilization (despite nvidia-smi reporting 95%+) wastes **$1.4 million per year** in capital alone. Add electricity, cooling, and engineering time debugging performance, and the number climbs past $2.5M.\n\n[75% of organizations](https://scailium.com/insights/gpu-utilization-enterprise-ai-crisis) report GPU utilization below 70% at peak. The gap between what `nvidia-smi`\n\nreports and actual compute efficiency is where millions of dollars disappear.\n\n`nvidia-smi`\n\nsamples GPU state once per second. It reports a binary: \"was a kernel running?\" It has zero visibility into:\n\nThese are all *host-side* problems causing *GPU-side* underutilization. nvidia-smi only sees the GPU side.\n\nThe tracer traces both sides: CUDA APIs (what the GPU is doing) and host kernel events (what the CPU is doing), then builds causal chains connecting them.\n\n``` bash\n$ ingero explain --since 120s\n\nSystem Context:\n  CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB\n\nCausal Chains (last 2 min):\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n[HIGH] CPU scheduling contention → CUDA throughput drop\n  Root: 14,504 context switches on training process (PID 3821)\n        Process off-CPU 62 of 120 seconds (51.7% of wall clock)\n  Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)\n          CUDA op throughput dropped 47% from peak (1,200 → 640 ops/sec)\n  Contributing: 4 DataLoader workers + 3 background processes competing for 8 cores\n  Fix: pin training to dedicated cores: taskset -c 0-3 python3 train.py\n       set DataLoader persistent_workers=True\n       nice -n 19 background jobs\n```\n\nThere it is. The training process was **off-CPU for 51.7% of the time**. The GPU was waiting, not computing. nvidia-smi saw kernels queued and reported \"97% utilized,\" but actual compute throughput was half of what it should be.\n\nUsing the MCP server:\n\n**Engineer**: \"Which processes caused the most scheduling contention in the last 2 minutes?\"\n\n```\nSELECT\n  pn.name as process,\n  COUNT(*) as context_switches,\n  SUM(duration_ns)/1e9 as total_off_cpu_sec,\n  MAX(duration_ns)/1e6 as worst_stall_ms\nFROM events e\nJOIN process_names pn ON e.pid = pn.pid\nWHERE op = 'sched_switch' AND timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events)\nGROUP BY pn.name\nORDER BY total_off_cpu_sec DESC\nLIMIT 10;\nprocess              | switches | off_cpu_sec | worst_stall_ms\n---------------------|----------|-------------|----------------\npython3 (train.py)   | 14,504   | 62.0        | 790.3\npt_data_worker:0     | 8,217    | 31.4        | 609.1\npt_data_worker:1     | 7,932    | 29.8        | 642.7\npt_data_worker:2     | 8,104    | 30.1        | 611.3\npt_data_worker:3     | 7,889    | 28.9        | 587.6\nprometheus-node-exp  | 3,201    | 8.7         | 45.2\nfluent-bit           | 2,890    | 7.1         | 38.9\n```\n\nThe training process and all 4 DataLoader workers are fighting for CPU. And the worst single stall is **790ms**, that's almost a full second where the training loop was frozen while the GPU sat idle.\n\nBackground monitoring agents (Prometheus node exporter, Fluent Bit) are stealing another 15+ seconds of CPU time.\n\n```\nSELECT\n  (timestamp / 10000000000) * 10 as window_sec,\n  COUNT(CASE WHEN op = 'sched_switch' THEN 1 END) as ctx_switches,\n  COUNT(CASE WHEN op = 'cudaStreamSync' THEN 1 END) as sync_calls,\n  AVG(CASE WHEN op = 'cudaStreamSync' THEN duration_ns END)/1000 as sync_avg_us\nFROM events\nWHERE timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events)\nGROUP BY window_sec\nORDER BY window_sec;\nwindow_sec | ctx_switches | sync_calls | sync_avg_us\n-----------|-------------|------------|------------\n0          | 342         | 89         | 52          ← baseline\n10         | 1,205       | 91         | 180         ← contention starts\n20         | 2,847       | 78         | 890         ← throughput drops\n30         | 3,102       | 64         | 1,420       ← GPU starving\n40         | 2,956       | 61         | 2,100       ← worst period\n50         | 1,834       | 72         | 780         ← partial recovery\n```\n\nAt the 40-second mark, context switches hit 3,000/10s and `cudaStreamSync`\n\naverage latency is 40x baseline. The GPU is doing 30% fewer sync calls, not because it's working harder, but because it has nothing to sync on. The pipeline is empty.\n\nWith `--stack`\n\nenabled, The tracer captures exactly which Python function was on-CPU when the stall happened:\n\n```\nTop cudaStreamSync callers during contention window (t=20-50s):\n  train.py:142  → cudaStreamSync | 89 calls | avg 1.8ms | max 7.2ms\n    ↳ loss.backward()\n  train.py:145  → cudaStreamSync | 34 calls | avg 2.1ms | max 4.9ms\n    ↳ optimizer.step()\n\nTop sched_switch victims:\n  train.py:138  → DataLoader.__next__() | preempted 4,201 times\n    ↳ waiting for batch from workers\n```\n\nThe training loop at line 138 is blocked waiting for the next batch. The DataLoader workers themselves are being preempted. The fix is clear.\n\nAfter applying fixes 1-3, the same training run:\n\nGPU underutilization is a trillion-dollar infrastructure problem hiding behind a misleading metric. Every ML team has hit this wall: training that should take 4 hours takes 12, and nobody can explain why because all the dashboards say the GPU is \"fine.\"\n\nThe problem is always on the host side: CPU scheduling, data loading, memory pressure, I/O contention. These are Linux kernel events. The only way to see them alongside CUDA behavior is to trace both layers simultaneously.\n\nThis is what the tracer does: eBPF uprobes on the CUDA libraries plus kernel tracepoints on the scheduler, memory subsystem, and I/O stack. No code changes, no SDK integration, <2% overhead. Production-safe.\n\nNo GPU needed to see the pattern:\n\n```\n# 1. Build\ngit clone https://github.com/ingero-io/ingero.git\ncd ingero && make build\n\n# 2. Try the demos\n./bin/ingero demo cpu-contention     # CPU scheduling delays causing GPU stalls\n./bin/ingero demo memcpy-bottleneck  # Data transfer dominating wall-clock time\n```\n\nFor real GPU tracing:\n\n```\nsudo ./bin/ingero trace --stack --duration 120s\n# ... run the training job ...\n./bin/ingero explain --since 120s\n```\n\n**GitHub (give us a star!):** [github.com/ingero-io/ingero](https://github.com/ingero-io/ingero). No NVIDIA SDK, no code changes, production-safe by design.\n\nWe believe that your GPU utilization metrics can be misleading, and we'd love to help! ** Drop us an issue on GitHub** and we will gladly dive into it together.\n\n*Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.*", "url": "https://wpnews.pro/news/nvidia-smi-reports-97-utilization-while-the-gpu-sits-idle", "canonical_source": "https://dev.to/ingero/nvidia-smi-reports-97-utilization-while-the-gpu-sits-idle-20j4", "published_at": "2026-06-12 14:30:00+00:00", "updated_at": "2026-06-12 14:41:11.391017+00:00", "lang": "en", "topics": ["machine-learning", "ai-infrastructure", "ai-chips", "mlops", "artificial-intelligence"], "entities": ["nvidia-smi", "H100", "PyTorch Lightning", "NVIDIA"], "alternates": {"html": "https://wpnews.pro/news/nvidia-smi-reports-97-utilization-while-the-gpu-sits-idle", "markdown": "https://wpnews.pro/news/nvidia-smi-reports-97-utilization-while-the-gpu-sits-idle.md", "text": "https://wpnews.pro/news/nvidia-smi-reports-97-utilization-while-the-gpu-sits-idle.txt", "jsonld": "https://wpnews.pro/news/nvidia-smi-reports-97-utilization-while-the-gpu-sits-idle.jsonld"}}