Keeping GPU Workloads NUMA-Local in Kubernetes

A 2-socket AMD EPYC server running Kubernetes requires careful NUMA alignment to avoid 3x latency penalties when GPUs access memory across CPU sockets. Platform engineers must configure BIOS settings like NPS (Nodes Per Socket) and use Kubernetes topology policies to pin GPU workloads to the same NUMA node as their local memory and PCIe root complex. Misalignment causes DMA reads to cross the interconnect, degrading inference performance in CPU-GPU pipelines where the CPU prepares batches for GPU processing.

“NUMA alignment” comes up frequently in GPU infrastructure discussions, but concepts like NUMA nodes, topology policies, and CPU pinning are often assumed rather than well understood. Getting it right is as much the platform engineer’s job as the workload owner’s. This post isn’t a comprehensive guide to NUMA architecture. It’s a practical account of what happens when you align CPU and GPU resources on Kubernetes nodes: the levels of isolation Kubernetes offers, the gotchas, and what it takes to make it work. My experience is on AMD EPYC hardware. Intel has analogous concepts Sub-NUMA Clustering instead of NPS, UPI instead of Infinity Fabric , but I haven’t worked with Intel in this context, so I’ll stick to what I know. If you’re already familiar with CPU cache hierarchies, sockets, and PCIe, skip ahead. Otherwise, expand below for the shared vocabulary used throughout the post. CPU Socket : The physical slot on a motherboard that holds a processor. Multi-socket servers commonly 2-socket have multiple processors, each with its own cores and local memory. Physical Core vs. Logical Core : A physical core is a single processing unit on the CPU die. With SMT https://en.wikipedia.org/wiki/Simultaneous multithreading “hyperthreading” on Intel , each physical core presents as 2 logical cores that share the core’s execution resources and caches. L1/L2 Cache : Small, fast caches private to each physical core L1 is smaller and faster than L2 . Two containers sharing a physical core, one logical core each, compete for the same L1/L2 space. L3 Cache Last-Level Cache : A larger cache shared among a group of cores. On AMD EPYC, it’s shared within a Core Complex CCD https://en.wikipedia.org/wiki/Chiplet AMD of typically 8 cores. Cores sharing an L3 cache can exchange data quickly through it. Interconnect : The high-speed link between CPU sockets Infinity Fabric https://en.wikipedia.org/wiki/Infinity Fabric on AMD, UPI https://en.wikipedia.org/wiki/Intel Ultra Path Interconnect on Intel . Accessing memory locally is faster than going cross-socket over the interconnect. PCIe : The bus connecting CPUs to devices like GPUs and NICs. Each PCIe root complex is wired to a specific CPU socket, so a GPU is physically closer to one socket than another. DMA : Lets devices like GPUs read from and write to system memory directly, without the CPU copying data byte-by-byte. If the data sits in memory attached to a different NUMA node than the GPU, the DMA read crosses the interconnect. NUMA Non-Uniform Memory Access describes a memory architecture where the time it takes a CPU core to access memory depends on where that memory physically sits relative to the core. In a 2-socket server, each socket has its own local memory. A core on socket 0 can access memory attached to socket 0 quickly local access , but accessing memory attached to socket 1 requires crossing the interconnect, which is slower. On AMD EPYC hardware, cross-socket memory access can incur roughly 3x the latency of local access. NUMA doesn’t only exist across sockets, though. On AMD EPYC processors, a BIOS setting called NPS Nodes Per Socket controls how many NUMA domains a single socket is divided into: The interactive diagram below shows a simplified view of a 2-socket AMD EPYC machine. Toggle between NPS modes to see how the NUMA boundaries change. In NPS1, each socket is one NUMA node. In NPS2 and NPS4, each socket is further subdivided. The CCD and GPU placement is illustrative, not a promise about every SKU. The key point : NUMA topology is a function of both the hardware and how it’s configured. You can’t assume a fixed number of NUMA nodes across your fleet unless you control the BIOS settings. Different SKUs have different core counts, different numbers of CCDs, and different NPS configurations, all of which change the NUMA geometry. GPU inference often follows a CPU-GPU pipeline: the CPU prepares requests, batches them, copies input data into the right format, feeds data to the GPU over PCIe, and then handles postprocessing on the GPU’s output. The GPU does the heavy computation, but it’s the CPU that keeps the pipeline fed. When a container’s CPUs are on a different NUMA node than its GPU, moving input data from CPU memory to GPU memory may cross a NUMA boundary. The GPU has to read data from memory attached to a farther-away CPU socket instead of memory attached to its local socket. This adds latency on the critical path. In one inference workload, we observed more than 30% higher p99 tail latency under load for pods whose CPUs spanned both sockets compared with pods whose CPUs stayed on the same socket. For a latency-sensitive service, that is enough to matter, and it happens silently unless someone is explicitly monitoring NUMA alignment. Nothing in Kubernetes surfaces it. The pod is running, serving traffic, and looking healthy, just consistently slower than its peers. Training workloads are affected too, though the impact profile is different. Data loading workers continuously preprocess batches on CPU and stage them for GPU consumption. Cross-NUMA data loaders contend for inter-socket bandwidth and add latency to every batch transfer. PyTorch’s own performance tuning guide https://pytorch.org/tutorials/recipes/recipes/tuning guide.html explicitly recommends binding training processes to a single NUMA node. For GPU workloads where the CPU is on the data path to the GPU, NUMA locality has a direct and measurable impact on performance. Kubernetes offers several knobs for CPU isolation, each providing stronger guarantees at the cost of more constraints. cpuManagerPolicy: static By default, Kubernetes lets the operating system’s CPU scheduler move a container’s processes across any available core. This is efficient for overall CPU utilization, but it means your container’s threads may move across cores, invalidating caches and sharing physical cores with other containers. Setting cpuManagerPolicy: static in the kubelet config changes this. Containers in Guaranteed QoS https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/ guaranteed pods where requests == limits with integer CPU requests get exclusive, pinned logical cores. Kubernetes won’t assign those exclusive CPUs to another container, and your processes stay put. Host daemons and kernel threads can still run there unless the platform also reserves or isolates CPUs for the OS. The way kubelet pins CPUs is by constraining the container’s cpuset cgroup to the assigned CPU list. The assigned cores can be seen in cpuset.cpus on cgroup v1 and cpuset.cpus.effective on cgroup v2. On the node, the exact path depends on the cgroup driver, runtime, QoS class, and pod UID formatting. With systemd-style kubepods slices, the files are roughly located here: KUBEPODS="<kubepods.slice/.../kubepods-pod<uid .slice/<container .scope " cgroup v1 cat /sys/fs/cgroup/cpuset/$KUBEPODS/cpuset.cpus cgroup v2 cat /sys/fs/cgroup/$KUBEPODS/cpuset.cpus.effective This alone improves performance consistency. Cache affinity improves because threads aren’t migrating across cores, and container-to-container CPU contention is reduced. But there’s a subtlety: you’re pinning logical cores hyperthreads , not physical cores. Two containers can still end up sharing a physical core if one gets one hyperthread and the other gets the sibling. They’ll contend for that physical core’s L1 and L2 cache. What it requires from workload owners: Set requests == limits for CPU and memory on all containers including init containers and sidecars to get Guaranteed QoS. CPU requests must be integers for the containers where pinning is desired. full-pcpus-only The full-pcpus-only https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/ static-policy-options CPU manager policy option cpuManagerPolicyOptions: full-pcpus-only=true takes isolation further. Instead of allocating individual logical cores, it allocates entire physical cores. Both hyperthreads of each core go to the same container.This eliminates L1/L2 cache contention between containers that would otherwise share a physical core. The trade-off: containers that receive exclusive CPUs must request a multiple of the SMT thread count, typically 2. A pinned container requesting 3 CPUs fails with an SMTAlignmentError covered in Failure Modes https://ronaknathani.com/index.xml smtalignmenterror below . Any existing pinned containers with odd CPU counts on the node need to be resized before you enable this option. What it requires from workload owners: Even CPU request values. Audit all containers, including sidecars and init containers. Fractional CPU values on sidecars and init containers are fine as those containers use the shared CPU pool and don’t get pinned. single-numa-node CPU pinning ensures your cores are dedicated, and full-pcpus-only ensures the container gets full physical cores. Neither guarantees that all your cores come from the same NUMA node. With the default static policy options, the kubelet’s CPU manager uses a packed allocation strategy that fills one NUMA node before spilling to the next more on this below https://ronaknathani.com/index.xml how-the-kubelet-allocates-cpus , but depending on node fragmentation, your container’s CPUs can still span NUMA boundaries. The topologyManagerPolicy: single-numa-node https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/ setting addresses this. The topology manager sits above the CPU manager, device manager, and memory manager, and coordinates resource allocation by collecting topology hints from each. With single-numa-node , it requires that hinted resources can be satisfied from a single NUMA node. If they can’t, the pod is rejected at admission time with a TopologyAffinityError .The default scope topologyManagerScope: container computes alignment independently for each container. That’s usually fine when one main container owns the GPU and the exclusive CPUs, while sidecars are unrelated to the latency-critical path and use fractional CPU from the shared pool. topologyManagerScope: pod is stricter: it asks whether the pod’s effective request fits on one NUMA node. Use it when multiple containers in the same pod are performance-coupled, not just because a logging or metrics sidecar exists. Caveat:Topology Manager only enforces resources that report topology hints. CPU hints come from CPU Manager, GPU hints come through Device Manager from plugins such as the nvidia-device-plugin, and memory hints come from Memory Manager.If the GPU device plugin does not report NUMA TopologyInfo , Topology Manager cannot force CPU-GPU locality. For guaranteed NUMA alignment, set memoryManagerPolicy: Static too. This makes requested memory part of topology admission along with CPU and GPU resources. The workload’s memory request must fit within the target NUMA node. Kubernetes also requires reservedMemory when memoryManagerPolicy: Static is enabled. This gives you the strongest isolation available. The container stays within one NUMA node and communicates with the GPU on that NUMA node without crossing the socket interconnect. CPU cache locality can still vary within that NUMA node depending on NPS configuration , but in testing, we have found that a container occupying an entire NUMA node with no overlap from other workloads is materially less affected by cache-thrashing or CPU-intensive noisy neighbors. What it requires from workload owners: The performance-critical container, or the effective pod request when using topologyManagerScope: pod , must fit within a single NUMA node. That means understanding the machine topology and resizing the workload as hardware or its configuration changes. A minimal kubelet config for a dedicated NUMA-aligned GPU pool looks like: cpuManagerPolicy: static cpuManagerPolicyOptions: full-pcpus-only: "true" topologyManagerPolicy: single-numa-node Default is container. Use pod only when the whole pod should fit on one NUMA node. topologyManagerScope: pod memoryManagerPolicy: Static memoryManagerPolicy: Static requires reservedMemory to be configured. Roll this out on dedicated, drained nodes. When changing CPU or memory manager policies, clear the kubelet state files before restarting kubelet: <kubelet-root-dir /cpu manager state and <kubelet-root-dir /memory manager state . Understanding the allocation algorithm helps explain how and when NUMA spillover happens. When cpuManagerPolicy: static is enabled with the default policy options, the kubelet uses a packed bin-pack allocation strategy: takeByTopologyNUMAPacked https://sourcegraph.com/r/github.com/kubernetes/kubernetes@78994b5cf1fd09d94f8f1748fac83d15eb83c479/-/blob/pkg/kubelet/cm/cpumanager/cpu assignment.go?L776 . It works top-down through the topology: The sort order is key: at every level, it prefers NUMA nodes with fewer remaining free CPUs. This packs nearly-exhausted NUMA nodes before touching less-used ones. The allocator usually keeps CPUs NUMA-local when there is room, but locality is not guaranteed. But “when possible” is doing a lot of work there. Consider a 2-socket machine with 48 cores per socket 96 vCPUs per socket with SMT . In NPS1 mode, this gives 2 NUMA nodes of 96 vCPUs each. After system and kube-reserved, suppose 90 vCPUs are allocatable per NUMA node, with 4 GPUs per NUMA node. Each pod requests 22 vCPUs. The first 4 pods land on NUMA 0: 4 x 22 = 88 vCPUs used, leaving only 2 allocatable vCPUs. The 5th pod requests 22 vCPUs, but only 2 remain on NUMA 0. The CPU manager takes those 2 from NUMA 0 and the remaining 20 from NUMA 1. The diagram below credit to my colleague Roman Lishtaba https://www.linkedin.com/in/rlishtaba/ for identifying this pattern in our GPU inference workloads shows exactly how this plays out: Without topology manager enforcement, the kubelet allocates CPUs from multiple NUMA nodes. Pod 5 runs fine, but its performance is degraded. Nothing in Kubernetes will tell you about this. With topologyManagerPolicy: single-numa-node , the system keeps the allocation bounded within one NUMA node. In this scenario, NUMA 1 still has 90 vCPUs free, so Pod 5 would land there entirely. TopologyAffinityError only fires when no single NUMA node can satisfy the request. Both cpuManagerPolicyOptions: full-pcpus-only=true and topologyManagerPolicy: single-numa-node introduce hard failure modes that are worth understanding before enabling them. When full-pcpus-only is enabled, the kubelet rejects any container that would receive exclusive CPUs but does not request a multiple of the SMT thread count, typically 2. The pod goes into Failed state with an SMTAlignmentError and stays there until someone deletes it. Workload controllers Deployments, StatefulSets will recreate the pod, but the replacement hits the same error on any node where full-pcpus-only is in effect. When topologyManagerPolicy: single-numa-node is enabled, the kubelet rejects any pod whose containers’ hinted resource requests can’t be satisfied from a single NUMA node. With topologyManagerScope: pod , that check applies to the pod’s effective request. The sequence is: TopologyAffinityError Same failure semantics as SMTAlignmentError : the pod is Failed and the scheduler won’t retry. The confusing part is that the node has sufficient aggregate capacity, but the pod still fails because no individual NUMA node has enough room. If you’re not thinking in NUMA terms, this is disorienting. cpuManagerPolicy: static on its own doesn’t introduce these failure modes. They come from the additional constraints of full-pcpus-only and single-numa-node . Both are node-level kubelet settings that apply to every pod on the node, which means enabling single-numa-node can break existing workloads that don’t fit in a single NUMA node. Dedicated node pools for NUMA-aligned workloads are a practical approach to mitigate this. A practical problem with single-numa-node is that the default Kubernetes scheduler sees only aggregate node resources. It doesn’t know that a node’s 60 free vCPUs are split 20/40 across two NUMA nodes. The scheduler can place a pod on a node, only for the kubelet to reject it at admission. The workload controller then creates a replacement, which may fail on the next node too. The NodeResourceTopologyMatch scheduler plugin https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/noderesourcetopology reduces this gap. It gives the scheduler per-NUMA-node resource visibility, so it can filter out nodes that can’t satisfy the topology constraints before placing the pod. Deploying it requires: NodeResourceTopology CRD and one NodeResourceTopology custom resource per node NodeResourceTopologyMatch scheduler plugin configured as a filter and scorerThat’s additional infrastructure for the platform team: a DaemonSet, a cluster-scoped CRD, one custom resource per node refreshed roughly every 60 seconds, and a scheduler plugin with its own cache. Without it, pods repeatedly fail on nodes that look like they have enough capacity. Getting NUMA alignment right is not something either platform admins or workload owners can do alone. It requires collaboration and shared understanding. With NUMA alignment, platform admins can’t just hand out node pools and let workload owners request whatever CPU/memory they want. They need to publish clear guidance: full-pcpus-only , single-numa-node , and any non-default topology manager scope and what failure modes they introduceFor example, on a 2-socket machine with 48 cores per socket in NPS1 mode: each NUMA node has 96 vCPUs, about 90 allocatable after reservations, with 4 GPUs per NUMA node. The recommended CPU request per GPU might be 22 vCPUs 22 x 4 = 88, fitting within the 90 available . If the fleet has multiple SKUs with different core counts or NPS configurations, this becomes a matrix of recommendations. GPU workloads are inherently more hardware-aware than typical Kubernetes workloads. Unlike a stateless web service where you declare CPU and memory and let the platform figure out placement, GPU inference and training benefit from understanding the machine topology. This means: full-pcpus-only is in effectAt the node level, start by confirming the hardware topology: lscpu -e=CPU,CORE,SOCKET,NODE numactl -H nvidia-smi topo -m NVIDIA GPU nodes Inside a running container, check the workload process’s CPU affinity and compare it with the node’s lscpu output to confirm the allowed CPUs sit within the expected NUMA node: kubectl exec <pod-name -c <container-name -- taskset -cp 1 If taskset is unavailable: kubectl exec <pod-name -c <container-name -- grep Cpus allowed list /proc/1/status taskset -cp 1 checks PID 1 in the container. If your workload runs as a different PID, check that process instead. Kubernetes has several CPU manager policy options https://kubernetes.io/docs/concepts/workloads/resource-managers/ cpu-policy-static--options adjacent to the path described above. strict-cpu-reservation keeps regular workloads off CPUs reserved for the OS and Kubernetes daemons, which helps reduce system noise on pinned workloads. prefer-align-cpus-by-uncorecache is a best-effort cache-locality option that tries to keep a container’s CPUs within the same L3 or uncore cache group. align-by-socket is useful when a container is too large to fit in one NUMA node and must use multiple NUMA nodes. It asks CPU Manager to keep that allocation within one socket when possible.These can improve isolation or latency, but they do not replace topologyManagerPolicy: single-numa-node for keeping a GPU workload NUMA-local. align-by-socket is also not compatible with single-numa-node . The Kubernetes DRA Dynamic Resource Allocation CPU driver https://github.com/kubernetes-sigs/dra-driver-cpu is interesting because it may allow NUMA-aware CPU placement to happen through the scheduling layer, without some of the post-scheduling admission issues described above. I haven’t explored it deeply enough to recommend it here. I’ll write a follow-up after I spend more time with it. There’s a clear progression of CPU isolation in Kubernetes: | Level | Config | What you get | What it requires | |---|---|---|---| | 1 | cpuManagerPolicy: static | Dedicated logical cores, reduced CPU contention | Guaranteed QoS, integer CPU requests | | 2 | + full-pcpus-only=true | Full physical cores, L1/L2 cache isolation | Even CPU request values | | 3 | + topologyManagerPolicy: single-numa-node and memoryManagerPolicy: Static | CPU, GPU, and memory admitted only if they fit one NUMA node | Critical container fits in a NUMA node, device plugin topology hints, reservedMemory , topology-aware scheduler, sizing guidance from platform team | Each level introduces stricter constraints in exchange for stronger performance isolation. For GPU inference, where the CPU is directly on the data path to the GPU, best-effort alignment is not always good enough. If a pod is misaligned, Kubernetes will not tell you, but the workload may still show worse tail latency. For consistency, a hard failure like TopologyAffinityError is often better than silently serving degraded traffic. Getting it right takes effort from both sides: platform teams publishing topology guidance and workload owners sizing containers to match. It is more work than treating compute as a black box, but GPU workloads typically need to be more aware of the underlying hardware than ordinary services.