NVIDIA Grove: Open-Source Kubernetes API for AI Inference

NVIDIA open-sourced Grove, a Kubernetes API for managing multi-component AI inference stacks, at KubeCon Europe 2026. Grove introduces custom resources for gang scheduling, topology-aware placement, and startup ordering to address limitations in plain Kubernetes for large language model inference. The release is part of a broader push to make Kubernetes a community-owned GPU orchestration platform, alongside donations of the DRA Driver and KAI Scheduler to the CNCF.

Running a 70B parameter model on Kubernetes sounds like solved infrastructure — until you try to split prefill and decode onto separate GPU pools, enforce startup ordering between routers and workers, and keep everything topology-aware. At that point, plain Kubernetes runs out of primitives fast. NVIDIA’s answer is Grove : an open-source Kubernetes API that lets you describe an entire multi-component AI inference stack in a single declarative resource. It went open-source at KubeCon Europe 2026 and is now expanding as part of a deliberate push to make Kubernetes the community-owned GPU orchestration platform for AI. The Problem Grove Solves Modern LLM inference is not one pod. A production deployment of a large model typically runs prefill workers compute-bound, processing incoming prompts and generating KV cache blocks , decode workers memory-bound, running autoregressive token generation , a request router, and monitoring sidecars. These components are tightly interdependent: the router cannot accept traffic until decode workers are ready, prefill and its workers should stay within the same NVLink domain for bandwidth reasons, and scaling decode capacity without proportionally scaling prefill breaks the whole system. Plain Kubernetes handles none of this natively. Gang scheduling — where all pods in a group must start together or none do — requires third-party tools. Topology-aware placement is bolted on via labels and node affinity rules that quickly become unmanageable at scale. Startup ordering across different deployment objects is typically hacked together with init containers and readiness gates. Every team running disaggregated inference ends up writing a custom Operator or a Helm chart that is secretly an Operator. The failure modes are specific. A 16-GPU prefill job with 15 of 16 pods scheduled will sit idle, consuming GPU memory while producing nothing. Topology-blind placement — prefill leader and workers in different racks — cuts throughput by 40 to 60 percent as traffic crosses slow inter-rack links instead of NVLink. A router that starts before decode workers are ready begins dropping requests immediately. What Grove Is Grove introduces three Kubernetes custom resource definitions that map directly to the structure of disaggregated inference workloads: PodClique — an atomic unit of one or more pods that must be gang-scheduled together PodCliqueScalingGroup — a bundle of PodCliques that scale together at a fixed ratio for example, one prefill leader to four prefill workers PodCliqueSet — the top-level resource describing the entire inference system, including all component groups, startup ordering, and topology constraints From one YAML manifest, the Grove controller coordinates hierarchical gang scheduling through the KAI Scheduler, topology-aware placement with pack or spread constraints, explicit startup ordering across component groups, and multi-level autoscaling that can scale individual pods, scaling groups, or entire replicas independently. The Grove quickstart https://github.com/ai-dynamo/grove/blob/main/docs/quickstart.md gets a local demo running on a kind cluster in around five minutes. The Bigger Stack Move Grove did not appear in isolation. NVIDIA made three significant open-source moves at KubeCon Europe 2026 that together form a community-owned GPU orchestration stack: DRA Driver for GPUs donated to the CNCF — The Dynamic Resource Allocation driver, which handles GPU sharing and multi-node NVLink ComputeDomains, moved from vendor governance to community ownership under the Kubernetes project SIGs. AWS, Google Cloud, Microsoft, Red Hat, Broadcom, Canonical, Nutanix, and SUSE are all collaborating on it. KAI Scheduler accepted as CNCF Sandbox project — NVIDIA’s topology-aware gang scheduler originally Run:ai is now community-governed and integrates directly with Grove for placement decisions. Grove integrated with llm-d — The workload lifecycle API is now being integrated into the llm-d inference stack https://thenewstack.io/llm-d-cncf-kubernetes-inference/ backed by Red Hat, Google, IBM Research, CoreWeave, and NVIDIA. The DRA driver move is the most significant piece. GPU resource allocation in Kubernetes has historically been handled by NVIDIA’s own device plugin — a vendor-controlled component. Moving the DRA driver to the CNCF means the core mechanism for allocating GPUs to pods becomes community-governed, similar to how networking and storage plugins evolved. That is a real structural shift, not a PR announcement. You can read the full KubeCon EU open-source announcement https://blogs.nvidia.com/blog/nvidia-at-kubecon-2026/ on the NVIDIA blog. When to Use Grove Now Grove is worth deploying today if you are already running disaggregated inference on Kubernetes using vLLM with custom scripts and hitting gang-scheduling deadlocks or topology placement issues. It is also the right move for teams on llm-d v0.7 or later — Grove integration is built in. Microsoft published a reference architecture for Dynamo-Grove on AKS https://blog.aks.azure.com/2026/06/02/dynamo-on-aks-part-4 in early June 2026, which is a practical starting point for teams on managed Kubernetes. If you are running single-pod model serving — one deployment, one model, horizontal scaling via standard HPA — Grove adds controller overhead without meaningful benefit. And if you are using a fully managed inference service on SageMaker, Vertex AI, or Azure ML, the scheduling problem is already handled upstream. Grove is for teams who need precise control over disaggregated topology and are running their own Kubernetes clusters. The bet NVIDIA is making is clear: Kubernetes becomes the default AI infrastructure layer, and the scheduling and orchestration primitives it lacks today get filled by open-source, community-governed components. Grove is one of those components. The combination of DRA driver, KAI Scheduler, and Grove gives teams running serious inference workloads a coherent, vendor-neutral stack that did not exist in this form a year ago. Whether that stack becomes the reference architecture depends on adoption by the llm-d community and cloud provider tooling — both of which are moving in the right direction. The Grove repository is at github.com/ai-dynamo/grove https://github.com/ai-dynamo/grove . The quickstart walks through a kind-based demo deployment. For production on AKS, the Microsoft reference architecture linked above is the most complete public guide available.