{"slug": "gpu-incident-at-3am-ebpf-tracing-from-page-to-root-cause-in-60-seconds", "title": "GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds", "summary": "A GPU training pipeline at a major AI company breached its SLA despite showing 97% GPU utilization across all monitoring tools (Datadog, Grafana, nvidia-smi). Using eBPF kernel tracing, an SRE identified the root cause in under 60 seconds: the host CPU was spending 51.7% of wall-clock time context-switching between the training process and monitoring agents (Prometheus node exporter, Fluent Bit), starving the GPU of data. The fix required pinning the training process to dedicated cores with `taskset` and renicing monitoring agents—no ML engineer was woken up.", "body_md": "3am page: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green, but the job is 3x slower than expected. Zero tools to diagnose this. eBPF kernel tracing produces causal chains in 60 seconds: the host CPU was fighting with DataLoader workers, starving the GPU. A `taskset`\n\nfix, back to sleep, no ML engineer woken up. This is a field guide for GPU incident response using eBPF tracing to go from alert to root cause in under a minute.\n\nGPU incident response starts with a page that makes no sense. PagerDuty fires:\n\n```\n[CRITICAL] GPU Training Pipeline SLA Breached\nCluster: prod-gpu-01 (8x H100)\nJob: nightly-retraining-v3\nExpected completion: 02:00 UTC\nCurrent status: 47% complete at 03:12 UTC\n```\n\nThe monitoring stack:\n\n**Datadog GPU Dashboard:**\n\n```\nGPU Utilization:  95%  ✅\nGPU Memory:       78%  ✅\nGPU Temperature:  72°C ✅\nPower Draw:       680W ✅\n```\n\n**Grafana (DCGM Exporter):**\n\n```\ndcgm_gpu_utilization:     0.95  ✅\ndcgm_fb_used:             62GB  ✅\ndcgm_sm_clock:            1980MHz ✅\n```\n\n**nvidia-smi:**\n\n```\n+-------------------------------------------+\n| GPU  Name     | GPU-Util | Memory-Usage    |\n|===============+==========+=================|\n|   0  H100 SXM |    97%  | 62000MiB / 80GB |\n+-------------------------------------------+\n```\n\nEvery single dashboard says the GPU is fine. A breached SLA and zero signal to work with.\n\nThis is where most GPU incidents stall. The SRE has no tools that see below the GPU utilization counter. The options are:\n\nEvery GPU monitoring tool in the stack (Datadog, Grafana, DCGM, nvidia-smi) reports the same underlying metric: **\"did the GPU have at least one kernel scheduled?\"**\n\nThat metric is useless for diagnosis. It's like monitoring a restaurant by checking \"is someone sitting at each table?\" without knowing if anyone is eating. The kitchen (GPU compute cores) could be idle 80% of the time between courses, and the dashboard would still say \"97% utilized.\"\n\nThe real problems that cause GPU SLA breaches are **host-side**:\n\nThese are all Linux kernel events. DCGM and nvidia-smi have zero visibility into them. GPU dashboards are structurally blind to the most common causes of GPU performance degradation.\n\nThe tracer is eBPF-based and captures **both sides**: CUDA APIs (what the GPU is doing) and host kernel events (what the CPU, scheduler, memory, and I/O subsystems are doing). It builds causal chains connecting host events to GPU latency.\n\nIt deploys as a K8s DaemonSet and runs continuously with <2% overhead. No code changes, no NVIDIA SDK, no CUPTI.\n\nHere's what incident response looks like:\n\n``` bash\n$ ingero explain --since 1h\nSystem Context:\n  CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB\n\nCausal Chains (last 1 hour):\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n\n[HIGH] CPU scheduling contention → CUDA throughput drop\n  Root: 14,504 context switches on training process (PID 3821)\n        Process off-CPU 62 of 120 seconds (51.7% of wall clock)\n  Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms)\n          CUDA op throughput dropped 47% from peak\n  Contributing: 4 DataLoader workers + prometheus-node-exporter\n                + fluent-bit competing for 8 cores\n  Fix: pin training to dedicated cores: taskset -c 0-5 python3 train.py\n       set DataLoader persistent_workers=True\n       nice -n 19 monitoring agents\n```\n\nThere it is. The training process was **off-CPU 51.7% of the time**. The GPU was waiting for data, not computing. Monitoring agents (Prometheus node exporter, Fluent Bit) were stealing CPU from the training pipeline.\n\nnvidia-smi said 97% because kernels were queued, but the pipeline was running at half speed.\n\n``` bash\n$ ingero explain --per-process --since 1h\nProcess Breakdown (last 1 hour):\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n\npython3 (train.py) PID 3821:\n  cudaStreamSync | 12,403 calls | p50=1.2ms | p99=7.2ms\n  cudaMalloc     |    206 calls | p50=65µs  | p99=2.1ms\n  cuLaunchKernel | 17,509 calls | p50=12µs  | p99=890µs\n  ⚠ Off-CPU: 62.0s / 120s (51.7%)\n  ⚠ Context switches: 14,504\n\npt_data_worker:0 PID 3822:\n  ⚠ Off-CPU: 31.4s / 120s (26.2%)\n  ⚠ Worst stall: 609ms\n\nprometheus-node-exporter PID 1205:\n  ⚠ Context switches: 3,201\n  ⚠ CPU stolen: 8.7s\n```\n\nThe training process and all 4 DataLoader workers are fighting for CPU with the monitoring stack. The worst single scheduling stall is **609ms**, over half a second where a data worker was frozen while the GPU sat idle.\n\n``` bash\n# Pin training to dedicated cores (leave 2 for monitoring + OS)\n$ kubectl exec -it gpu-training-pod -- taskset -c 0-5 python3 train.py\n\n# Or: deprioritize monitoring agents\n$ kubectl exec -it monitoring-pod -- nice -n 19 prometheus-node-exporter\n```\n\nOr better yet, add to the DaemonSet config:\n\n```\n# training pod\nresources:\n  limits:\n    cpu: \"6\"\n  requests:\n    cpu: \"6\"\n```\n\nAfter the fix:\n\nIngero includes an MCP (Model Context Protocol) server that lets AI assistants investigate GPU incidents. Teams using Claude, Cursor, or any MCP-compatible tool, the AI can query Ingero directly:\n\n**SRE asks Claude**: \"Why is the GPU training pipeline slow on prod-gpu-01?\"\n\nClaude queries the MCP server and responds:\n\nI found a HIGH severity causal chain on prod-gpu-01. The training process (PID 3821) was off-CPU for 51.7% of wall-clock time due to CPU scheduling contention. 14,504 context switches occurred, primarily caused by 4 DataLoader workers and monitoring agents (prometheus-node-exporter, fluent-bit) competing for 8 CPU cores.\n\nThe GPU was effectively starved. [cudaStreamSync](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html) p99 inflated from 7µs to 7.2ms (1,028x), and CUDA throughput dropped 47%.\n\nRecommended fix: Pin the training process to dedicated cores with `taskset -c 0-5`\n\n, and set `nice -n 19`\n\nfor monitoring agents.\n\nThis turns a 2-hour debugging session into a 30-second conversation.\n\nThe tracer deploys like any other observability agent in a K8s stack:\n\n```\n# Helm install (DaemonSet + RBAC)\nhelm install ingero ./deploy/helm/ingero \\\n  --set prometheus.enabled=true \\\n  --set otlp.enabled=true\n\n# Or standalone\nsudo ./bin/ingero trace --stack --prometheus :9090\n```\n\n**What this provides**:\n\nIt slots into an existing monitoring stack. No rip-and-replace.\n\nSignalDCGM / nvidia-smiIngeroGPU utilization %Yes (misleading)Yes (with causal context)Per-CUDA-call latencyNoYes (p50/p95/p99 for every API call)CPU scheduling delaysNoYes (sched_switch tracepoints)DataLoader worker stallsNoYes (per-process off-CPU time)Memory pressure → GPU impactNoYes (mm_page_alloc + CUDA correlation)Disk I/O → GPU stallsNoYes (block_rq + CUDA correlation)Network → distributed trainingNoYes (tcp_retransmit + CUDA correlation)Root cause chainNoYes (automated causal chains with fix recommendations)Python source line attributionNoYes (CPython frame extraction with -stack)\n\nFor SREs managing GPU infrastructure, the tracer answers three questions:\n\nNo need to understand CUDA or ML model architectures. The tracer translates kernel-level GPU events into actionable SRE language: root cause, impact, fix.\n\nNo GPU required to see the pattern:\n\n```\n# 1. Build\ngit clone https://github.com/ingero-io/ingero.git\ncd ingero && make build\n\n# 2. Try the demos\n./bin/ingero demo incident          # See a causal chain form in real-time\n./bin/ingero demo cpu-contention    # CPU scheduling causing GPU stalls\n```\n\nFor the GPU tracing:\n\n```\nsudo ./bin/ingero check              # Verify system compatibility\n\nsudo ./bin/ingero trace --stack      # Start tracing (runs continuously)\n\n./bin/ingero explain --since 5min    # See causal chains\n```\n\n**GitHub (give us a star!):** [github.com/ingero-io/ingero](https://github.com/ingero-io/ingero). No NVIDIA SDK, no code changes, production-safe by design.\n\nIf you are seeing GPU incidents in your own workloads, we'd love to take a look. [ Drop an issue on GitHub ](https://ingero.io/wp-admin/post.php?post=127&action=edit)and we will gladly dive into it together.\n\n*Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.*", "url": "https://wpnews.pro/news/gpu-incident-at-3am-ebpf-tracing-from-page-to-root-cause-in-60-seconds", "canonical_source": "https://dev.to/ingero/gpu-incident-at-3am-ebpf-tracing-from-page-to-root-cause-in-60-seconds-pm5", "published_at": "2026-06-05 14:30:00+00:00", "updated_at": "2026-06-05 14:42:27.534997+00:00", "lang": "en", "topics": ["machine-learning", "mlops", "ai-infrastructure", "ai-tools"], "entities": ["Datadog", "Grafana", "DCGM Exporter", "nvidia-smi", "PagerDuty", "H100", "eBPF", "taskset"], "alternates": {"html": "https://wpnews.pro/news/gpu-incident-at-3am-ebpf-tracing-from-page-to-root-cause-in-60-seconds", "markdown": "https://wpnews.pro/news/gpu-incident-at-3am-ebpf-tracing-from-page-to-root-cause-in-60-seconds.md", "text": "https://wpnews.pro/news/gpu-incident-at-3am-ebpf-tracing-from-page-to-root-cause-in-60-seconds.txt", "jsonld": "https://wpnews.pro/news/gpu-incident-at-3am-ebpf-tracing-from-page-to-root-cause-in-60-seconds.jsonld"}}