GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds

wpnews.pro

3am page: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green, but the job is 3x slower than expected. Zero tools to diagnose this. eBPF kernel tracing produces causal chains in 60 seconds: the host CPU was fighting with Data workers, starving the GPU. A taskset

fix, back to sleep, no ML engineer woken up. This is a field guide for GPU incident response using eBPF tracing to go from alert to root cause in under a minute.

GPU incident response starts with a page that makes no sense. PagerDuty fires:

[CRITICAL] GPU Training Pipeline SLA Breached
Cluster: prod-gpu-01 (8x H100)
Job: nightly-retraining-v3
Expected completion: 02:00 UTC
Current status: 47% complete at 03:12 UTC

The monitoring stack:

Datadog GPU Dashboard:

GPU Utilization:  95%  ✅
GPU Memory:       78%  ✅
GPU Temperature:  72°C ✅
Power Draw:       680W ✅

Grafana (DCGM Exporter):

dcgm_gpu_utilization:     0.95  ✅
dcgm_fb_used:             62GB  ✅
dcgm_sm_clock:            1980MHz ✅

nvidia-smi:

+-------------------------------------------+
| GPU  Name     | GPU-Util | Memory-Usage    |
|===============+==========+=================|
|   0  H100 SXM |    97%  | 62000MiB / 80GB |
+-------------------------------------------+

Every single dashboard says the GPU is fine. A breached SLA and zero signal to work with.

This is where most GPU incidents stall. The SRE has no tools that see below the GPU utilization counter. The options are:

Every GPU monitoring tool in the stack (Datadog, Grafana, DCGM, nvidia-smi) reports the same underlying metric: "did the GPU have at least one kernel scheduled?"

That metric is useless for diagnosis. It's like monitoring a restaurant by checking "is someone sitting at each table?" without knowing if anyone is eating. The kitchen (GPU compute cores) could be idle 80% of the time between courses, and the dashboard would still say "97% utilized."

The real problems that cause GPU SLA breaches are host-side:

These are all Linux kernel events. DCGM and nvidia-smi have zero visibility into them. GPU dashboards are structurally blind to the most common causes of GPU performance degradation.

The tracer is eBPF-based and captures both sides: CUDA APIs (what the GPU is doing) and host kernel events (what the CPU, scheduler, memory, and I/O subsystems are doing). It builds causal chains connecting host events to GPU latency.

It deploys as a K8s DaemonSet and runs continuously with <2% overhead. No code changes, no NVIDIA SDK, no CUPTI.

Here's what incident response looks like:

$ ingero explain --since 1h
System Context:
  CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB

Causal Chains (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[HIGH] CPU scheduling contention → CUDA throughput drop
  Root: 14,504 context switches on training process (PID 3821)
        Process off-CPU 62 of 120 seconds (51.7% of wall clock)
  Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)
          CUDA op throughput dropped 47% from peak
  Contributing: 4 Data workers + prometheus-node-exporter
                + fluent-bit competing for 8 cores
  Fix: pin training to dedicated cores: taskset -c 0-5 python3 train.py
       set Data persistent_workers=True
       nice -n 19 monitoring agents

There it is. The training process was off-CPU 51.7% of the time. The GPU was waiting for data, not computing. Monitoring agents (Prometheus node exporter, Fluent Bit) were stealing CPU from the training pipeline.

nvidia-smi said 97% because kernels were queued, but the pipeline was running at half speed.

$ ingero explain --per-process --since 1h
Process Breakdown (last 1 hour):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

python3 (train.py) PID 3821:
  cudaStreamSync | 12,403 calls | p50=1.2ms | p99=7.2ms
  cudaMalloc     |    206 calls | p50=65µs  | p99=2.1ms
  cuLaunchKernel | 17,509 calls | p50=12µs  | p99=890µs
  ⚠ Off-CPU: 62.0s / 120s (51.7%)
  ⚠ Context switches: 14,504

pt_data_worker:0 PID 3822:
  ⚠ Off-CPU: 31.4s / 120s (26.2%)
  ⚠ Worst stall: 609ms

prometheus-node-exporter PID 1205:
  ⚠ Context switches: 3,201
  ⚠ CPU stolen: 8.7s

The training process and all 4 Data workers are fighting for CPU with the monitoring stack. The worst single scheduling stall is 609ms, over half a second where a data worker was frozen while the GPU sat idle.

$ kubectl exec -it gpu-training-pod -- taskset -c 0-5 python3 train.py

$ kubectl exec -it monitoring-pod -- nice -n 19 prometheus-node-exporter

Or better yet, add to the DaemonSet config:

resources:
  limits:
    cpu: "6"
  requests:
    cpu: "6"

After the fix:

Ingero includes an MCP (Model Context Protocol) server that lets AI assistants investigate GPU incidents. Teams using Claude, Cursor, or any MCP-compatible tool, the AI can query Ingero directly:

SRE asks Claude: "Why is the GPU training pipeline slow on prod-gpu-01?"

Claude queries the MCP server and responds:

I found a HIGH severity causal chain on prod-gpu-01. The training process (PID 3821) was off-CPU for 51.7% of wall-clock time due to CPU scheduling contention. 14,504 context switches occurred, primarily caused by 4 Data workers and monitoring agents (prometheus-node-exporter, fluent-bit) competing for 8 CPU cores.

The GPU was effectively starved. cudaStreamSync p99 inflated from 7µs to 7.2ms (1,028x), and CUDA throughput dropped 47%.

Recommended fix: Pin the training process to dedicated cores with taskset -c 0-5

, and set nice -n 19

for monitoring agents.

This turns a 2-hour debugging session into a 30-second conversation.

The tracer deploys like any other observability agent in a K8s stack:

helm install ingero ./deploy/helm/ingero \
  --set prometheus.enabled=true \
  --set otlp.enabled=true

sudo ./bin/ingero trace --stack --prometheus :9090

What this provides:

It slots into an existing monitoring stack. No rip-and-replace.

SignalDCGM / nvidia-smiIngeroGPU utilization %Yes (misleading)Yes (with causal context)Per-CUDA-call latencyNoYes (p50/p95/p99 for every API call)CPU scheduling delaysNoYes (sched_switch tracepoints)Data worker stallsNoYes (per-process off-CPU time)Memory pressure → GPU impactNoYes (mm_page_alloc + CUDA correlation)Disk I/O → GPU stallsNoYes (block_rq + CUDA correlation)Network → distributed trainingNoYes (tcp_retransmit + CUDA correlation)Root cause chainNoYes (automated causal chains with fix recommendations)Python source line attributionNoYes (CPython frame extraction with -stack)

For SREs managing GPU infrastructure, the tracer answers three questions:

No need to understand CUDA or ML model architectures. The tracer translates kernel-level GPU events into actionable SRE language: root cause, impact, fix.

No GPU required to see the pattern:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

./bin/ingero demo incident          # See a causal chain form in real-time
./bin/ingero demo cpu-contention    # CPU scheduling causing GPU stalls

For the GPU tracing:

sudo ./bin/ingero check              # Verify system compatibility

sudo ./bin/ingero trace --stack      # Start tracing (runs continuously)

./bin/ingero explain --since 5min    # See causal chains

GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

If you are seeing GPU incidents in your own workloads, we'd love to take a look. Drop an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

source & further reading

dev.to — original article Gemini 3.6 Flash: the Thinking Dial That Moves Cost 30x (Measured) What Is Agentic Test Creation and How Is It Different from AI Test Generation? The Watermelon Effect: How My AI Scored 94% in Testing But Only 22.2% in Real Use

GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds

Run your AI side-project on zahid.host