Keeping GPU Workloads NUMA-Local in Kubernetes A 2-socket AMD EPYC server running Kubernetes requires careful NUMA alignment to avoid 3x latency penalties when GPUs access memory across CPU sockets. Platform engineers must configure BIOS settings like NPS (Nodes Per Socket) and use Kubernetes topology policies to pin GPU workloads to the same NUMA node as their local memory and PCIe root complex. Misalignment causes DMA reads to cross the interconnect, degrading inference performance in CPU-GPU pipelines where the CPU prepares batches for GPU processing. “NUMA alignment” comes up frequently in GPU infrastructure discussions, but concepts like NUMA nodes, topology policies, and CPU pinning are often assumed rather than well understood. Getting it right is as much the platform engineer’s job as the workload owner’s. This post isn’t a comprehensive guide to NUMA architecture. It’s a practical account of what happens when you align CPU and GPU resources on Kubernetes nodes: the levels of isolation Kubernetes offers, the gotchas, and what it takes to make it work. My experience is on AMD EPYC hardware. Intel has analogous concepts Sub-NUMA Clustering instead of NPS, UPI instead of Infinity Fabric , but I haven’t worked with Intel in this context, so I’ll stick to what I know. If you’re already familiar with CPU cache hierarchies, sockets, and PCIe, skip ahead. Otherwise, expand below for the shared vocabulary used throughout the post. CPU Socket : The physical slot on a motherboard that holds a processor. Multi-socket servers commonly 2-socket have multiple processors, each with its own cores and local memory. Physical Core vs. Logical Core : A physical core is a single processing unit on the CPU die. With SMT https://en.wikipedia.org/wiki/Simultaneous multithreading “hyperthreading” on Intel , each physical core presents as 2 logical cores that share the core’s execution resources and caches. L1/L2 Cache : Small, fast caches private to each physical core L1 is smaller and faster than L2 . Two containers sharing a physical core, one logical core each, compete for the same L1/L2 space. L3 Cache Last-Level Cache : A larger cache shared among a group of cores. On AMD EPYC, it’s shared within a Core Complex CCD https://en.wikipedia.org/wiki/Chiplet AMD of typically 8 cores. Cores sharing an L3 cache can exchange data quickly through it. Interconnect : The high-speed link between CPU sockets Infinity Fabric https://en.wikipedia.org/wiki/Infinity Fabric on AMD, UPI https://en.wikipedia.org/wiki/Intel Ultra Path Interconnect on Intel . Accessing memory locally is faster than going cross-socket over the interconnect. PCIe : The bus connecting CPUs to devices like GPUs and NICs. Each PCIe root complex is wired to a specific CPU socket, so a GPU is physically closer to one socket than another. DMA : Lets devices like GPUs read from and write to system memory directly, without the CPU copying data byte-by-byte. If the data sits in memory attached to a different NUMA node than the GPU, the DMA read crosses the interconnect. NUMA Non-Uniform Memory Access describes a memory architecture where the time it takes a CPU core to access memory depends on where that memory physically sits relative to the core. In a 2-socket server, each socket has its own local memory. A core on socket 0 can access memory attached to socket 0 quickly local access , but accessing memory attached to socket 1 requires crossing the interconnect, which is slower. On AMD EPYC hardware, cross-socket memory access can incur roughly 3x the latency of local access. NUMA doesn’t only exist across sockets, though. On AMD EPYC processors, a BIOS setting called NPS Nodes Per Socket controls how many NUMA domains a single socket is divided into: The interactive diagram below shows a simplified view of a 2-socket AMD EPYC machine. Toggle between NPS modes to see how the NUMA boundaries change. In NPS1, each socket is one NUMA node. In NPS2 and NPS4, each socket is further subdivided. The CCD and GPU placement is illustrative, not a promise about every SKU. The key point : NUMA topology is a function of both the hardware and how it’s configured. You can’t assume a fixed number of NUMA nodes across your fleet unless you control the BIOS settings. Different SKUs have different core counts, different numbers of CCDs, and different NPS configurations, all of which change the NUMA geometry. GPU inference often follows a CPU-GPU pipeline: the CPU prepares requests, batches them, copies input data into the right format, feeds data to the GPU over PCIe, and then handles postprocessing on the GPU’s output. The GPU does the heavy computation, but it’s the CPU that keeps the pipeline fed. When a container’s CPUs are on a different NUMA node than its GPU, moving input data from CPU memory to GPU memory may cross a NUMA boundary. The GPU has to read data from memory attached to a farther-away CPU socket instead of memory attached to its local socket. This adds latency on the critical path. In one inference workload, we observed more than 30% higher p99 tail latency under load for pods whose CPUs spanned both sockets compared with pods whose CPUs stayed on the same socket. For a latency-sensitive service, that is enough to matter, and it happens silently unless someone is explicitly monitoring NUMA alignment. Nothing in Kubernetes surfaces it. The pod is running, serving traffic, and looking healthy, just consistently slower than its peers. Training workloads are affected too, though the impact profile is different. Data loading workers continuously preprocess batches on CPU and stage them for GPU consumption. Cross-NUMA data loaders contend for inter-socket bandwidth and add latency to every batch transfer. PyTorch’s own performance tuning guide https://pytorch.org/tutorials/recipes/recipes/tuning guide.html explicitly recommends binding training processes to a single NUMA node. For GPU workloads where the CPU is on the data path to the GPU, NUMA locality has a direct and measurable impact on performance. Kubernetes offers several knobs for CPU isolation, each providing stronger guarantees at the cost of more constraints. cpuManagerPolicy: static By default, Kubernetes lets the operating system’s CPU scheduler move a container’s processes across any available core. This is efficient for overall CPU utilization, but it means your container’s threads may move across cores, invalidating caches and sharing physical cores with other containers. Setting cpuManagerPolicy: static in the kubelet config changes this. Containers in Guaranteed QoS https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/ guaranteed pods where requests == limits with integer CPU requests get exclusive, pinned logical cores. Kubernetes won’t assign those exclusive CPUs to another container, and your processes stay put. Host daemons and kernel threads can still run there unless the platform also reserves or isolates CPUs for the OS. The way kubelet pins CPUs is by constraining the container’s cpuset cgroup to the assigned CPU list. The assigned cores can be seen in cpuset.cpus on cgroup v1 and cpuset.cpus.effective on cgroup v2. On the node, the exact path depends on the cgroup driver, runtime, QoS class, and pod UID formatting. With systemd-style kubepods slices, the files are roughly located here: KUBEPODS="