GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds

A GPU training pipeline at a major AI company breached its SLA despite showing 97% GPU utilization across all monitoring tools (Datadog, Grafana, nvidia-smi). Using eBPF kernel tracing, an SRE identified the root cause in under 60 seconds: the host CPU was spending 51.7% of wall-clock time context-switching between the training process and monitoring agents (Prometheus node exporter, Fluent Bit), starving the GPU of data. The fix required pinning the training process to dedicated cores with `taskset` and renicing monitoring agents—no ML engineer was woken up.

3am page: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green, but the job is 3x slower than expected. Zero tools to diagnose this. eBPF kernel tracing produces causal chains in 60 seconds: the host CPU was fighting with DataLoader workers, starving the GPU. A taskset fix, back to sleep, no ML engineer woken up. This is a field guide for GPU incident response using eBPF tracing to go from alert to root cause in under a minute. GPU incident response starts with a page that makes no sense. PagerDuty fires: CRITICAL GPU Training Pipeline SLA Breached Cluster: prod-gpu-01 8x H100 Job: nightly-retraining-v3 Expected completion: 02:00 UTC Current status: 47% complete at 03:12 UTC The monitoring stack: Datadog GPU Dashboard: GPU Utilization: 95% ✅ GPU Memory: 78% ✅ GPU Temperature: 72°C ✅ Power Draw: 680W ✅ Grafana DCGM Exporter : dcgm gpu utilization: 0.95 ✅ dcgm fb used: 62GB ✅ dcgm sm clock: 1980MHz ✅ nvidia-smi: +-------------------------------------------+ | GPU Name | GPU-Util | Memory-Usage | |===============+==========+=================| | 0 H100 SXM | 97% | 62000MiB / 80GB | +-------------------------------------------+ Every single dashboard says the GPU is fine. A breached SLA and zero signal to work with. This is where most GPU incidents stall. The SRE has no tools that see below the GPU utilization counter. The options are: Every GPU monitoring tool in the stack Datadog, Grafana, DCGM, nvidia-smi reports the same underlying metric: "did the GPU have at least one kernel scheduled?" That metric is useless for diagnosis. It's like monitoring a restaurant by checking "is someone sitting at each table?" without knowing if anyone is eating. The kitchen GPU compute cores could be idle 80% of the time between courses, and the dashboard would still say "97% utilized." The real problems that cause GPU SLA breaches are host-side : These are all Linux kernel events. DCGM and nvidia-smi have zero visibility into them. GPU dashboards are structurally blind to the most common causes of GPU performance degradation. The tracer is eBPF-based and captures both sides : CUDA APIs what the GPU is doing and host kernel events what the CPU, scheduler, memory, and I/O subsystems are doing . It builds causal chains connecting host events to GPU latency. It deploys as a K8s DaemonSet and runs continuously with <2% overhead. No code changes, no NVIDIA SDK, no CUPTI. Here's what incident response looks like: bash $ ingero explain --since 1h System Context: CPU: 94.2% | Memory: 78.1% | Load: 12.3 8 cores | Swap: 0 MB Causal Chains last 1 hour : ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ HIGH CPU scheduling contention → CUDA throughput drop Root: 14,504 context switches on training process PID 3821 Process off-CPU 62 of 120 seconds 51.7% of wall clock Effect: cudaStreamSync p99 inflated 1,028x 7µs → 7.2ms CUDA op throughput dropped 47% from peak Contributing: 4 DataLoader workers + prometheus-node-exporter + fluent-bit competing for 8 cores Fix: pin training to dedicated cores: taskset -c 0-5 python3 train.py set DataLoader persistent workers=True nice -n 19 monitoring agents There it is. The training process was off-CPU 51.7% of the time . The GPU was waiting for data, not computing. Monitoring agents Prometheus node exporter, Fluent Bit were stealing CPU from the training pipeline. nvidia-smi said 97% because kernels were queued, but the pipeline was running at half speed. bash $ ingero explain --per-process --since 1h Process Breakdown last 1 hour : ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ python3 train.py PID 3821: cudaStreamSync | 12,403 calls | p50=1.2ms | p99=7.2ms cudaMalloc | 206 calls | p50=65µs | p99=2.1ms cuLaunchKernel | 17,509 calls | p50=12µs | p99=890µs ⚠ Off-CPU: 62.0s / 120s 51.7% ⚠ Context switches: 14,504 pt data worker:0 PID 3822: ⚠ Off-CPU: 31.4s / 120s 26.2% ⚠ Worst stall: 609ms prometheus-node-exporter PID 1205: ⚠ Context switches: 3,201 ⚠ CPU stolen: 8.7s The training process and all 4 DataLoader workers are fighting for CPU with the monitoring stack. The worst single scheduling stall is 609ms , over half a second where a data worker was frozen while the GPU sat idle. bash Pin training to dedicated cores leave 2 for monitoring + OS $ kubectl exec -it gpu-training-pod -- taskset -c 0-5 python3 train.py Or: deprioritize monitoring agents $ kubectl exec -it monitoring-pod -- nice -n 19 prometheus-node-exporter Or better yet, add to the DaemonSet config: training pod resources: limits: cpu: "6" requests: cpu: "6" After the fix: Ingero includes an MCP Model Context Protocol server that lets AI assistants investigate GPU incidents. Teams using Claude, Cursor, or any MCP-compatible tool, the AI can query Ingero directly: SRE asks Claude : "Why is the GPU training pipeline slow on prod-gpu-01?" Claude queries the MCP server and responds: I found a HIGH severity causal chain on prod-gpu-01. The training process PID 3821 was off-CPU for 51.7% of wall-clock time due to CPU scheduling contention. 14,504 context switches occurred, primarily caused by 4 DataLoader workers and monitoring agents prometheus-node-exporter, fluent-bit competing for 8 CPU cores. The GPU was effectively starved. cudaStreamSync https://docs.nvidia.com/cuda/cuda-runtime-api/group CUDART STREAM.html p99 inflated from 7µs to 7.2ms 1,028x , and CUDA throughput dropped 47%. Recommended fix: Pin the training process to dedicated cores with taskset -c 0-5 , and set nice -n 19 for monitoring agents. This turns a 2-hour debugging session into a 30-second conversation. The tracer deploys like any other observability agent in a K8s stack: Helm install DaemonSet + RBAC helm install ingero ./deploy/helm/ingero \ --set prometheus.enabled=true \ --set otlp.enabled=true Or standalone sudo ./bin/ingero trace --stack --prometheus :9090 What this provides : It slots into an existing monitoring stack. No rip-and-replace. SignalDCGM / nvidia-smiIngeroGPU utilization %Yes misleading Yes with causal context Per-CUDA-call latencyNoYes p50/p95/p99 for every API call CPU scheduling delaysNoYes sched switch tracepoints DataLoader worker stallsNoYes per-process off-CPU time Memory pressure → GPU impactNoYes mm page alloc + CUDA correlation Disk I/O → GPU stallsNoYes block rq + CUDA correlation Network → distributed trainingNoYes tcp retransmit + CUDA correlation Root cause chainNoYes automated causal chains with fix recommendations Python source line attributionNoYes CPython frame extraction with -stack For SREs managing GPU infrastructure, the tracer answers three questions: No need to understand CUDA or ML model architectures. The tracer translates kernel-level GPU events into actionable SRE language: root cause, impact, fix. No GPU required to see the pattern: 1. Build git clone https://github.com/ingero-io/ingero.git cd ingero && make build 2. Try the demos ./bin/ingero demo incident See a causal chain form in real-time ./bin/ingero demo cpu-contention CPU scheduling causing GPU stalls For the GPU tracing: sudo ./bin/ingero check Verify system compatibility sudo ./bin/ingero trace --stack Start tracing runs continuously ./bin/ingero explain --since 5min See causal chains GitHub give us a star : github.com/ingero-io/ingero https://github.com/ingero-io/ingero . No NVIDIA SDK, no code changes, production-safe by design. If you are seeing GPU incidents in your own workloads, we'd love to take a look. Drop an issue on GitHub https://ingero.io/wp-admin/post.php?post=127&action=edit and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 user-space + GPL-2.0/BSD-3 eBPF kernel-space . One binary, zero dependencies, <2% overhead.