Experimenting with TPUs, GKE Managed DRANET, and Multi-cluster Inference Gateway

Google engineers deployed a Gemma 3 large language model across two GKE clusters in separate regions, using four TPU v6e chips per cluster, to test multi-cluster failover with the Inference Gateway. The setup leveraged managed Dynamic Resource Allocation (DRANET) and Cloud Storage FUSE to serve the model, with traffic routing to the nearest region and automatically failing over if one region went down. The experiment demonstrated how Kubernetes' Dynamic Resource Allocation and Inference Gateway can maintain AI workload availability across regions.

What happens when your workload fails in one region but you need access to service? This is a common case for availability and uptime. With recent enhancement to the Kubernetes ecosystem and capabilities like Dynamic Resource Allocation DRA https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ and Inference Gateway. https://gateway-api-inference-extension.sigs.k8s.io/ I decided to experiment with these capabilities in Google Cloud for a simple test using an AI inference workload. In this blog, we will explore this setup and you can also jump straight into the detailed configs in this codelab Build multi-cluster GKE Inference Gateway, with TPUs , Cloud Storage FUSE and managed DRANET. https://codelabs.developers.google.com/codelabs/gke-inference-gateway-multi-cluster-tpus-dranet 0 To build out this experiment, use the following products, features, and tools: Google Kubernetes Engine GKE managed DRANET https://docs.cloud.google.com/kubernetes-engine/docs/how-to/allocate-network-resources-dra : This is a managed feature that lets you request and share resources among Pods. This supports Multi-cluster GKE Inference gateway https://docs.cloud.google.com/kubernetes-engine/docs/concepts/about-multi-cluster-inference-gateway : Load balances your AI/ML inference workloads across multiple GKE clusters. This works in a failover situation which is what my experiment intended to test. The type which supports this is the gke-l7-cross-regional-internal-managed-mc Cloud Storage FUSE https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/overview : Provides a way to store data, models, checkpoints, and logs directly in Cloud Storage. To speed up the deployment, an open source gemma model was downloaded to this storage for retrieval. Virtual private Cloud VPC : The foundational global network providing isolated, secure communication for the internal load balancers and compute nodes GKE Fleets https://docs.cloud.google.com/kubernetes-engine/docs/fleets-overview : Fleets group the separate regional clusters under a unified management control plane TPU v6e : Google's custom AI accelerators that provide the high-performance compute required to serve the model. The VM family type used was the ct6e-standard-4t in a 2x2 Slice https://docs.cloud.google.com/tpu/docs/v6e configurations The aim is to deploy a LLM model Gemma 3 onto 2 GKE clusters in different regions. Each cluster will use 4 TPU v6e chips. The model should be stored in Cloud Storage. The workload is served using GKE Inference Gateway which supports multi-clusters. The traffic should be routed to the region closest to the user and failover to the other region if one region fails. Begin: Set up the environment. Create a standard VPC https://docs.cloud.google.com/vpc/docs/create-modify-vpc-networks create-custom-network , with firewall rules and subnet in the same zone as the reservation. Create a proxy-only subnet https://docs.cloud.google.com/load-balancing/docs/proxy-only-subnets proxy only subnet create this will be used with the Internal regional application load balancer attached to the GKE inference gateway Set up firewall rules allowing traffic and health checks. Reserve static internal IP addresses in both regions for the Gateway. Provision a Cloud Storage FUSE bucket and configure a dedicated IAM Service Account. Bind this to a Kubernetes Workload Identity so your pods can securely mount the bucket and read the model weights directly. Next: Create standard GKE clusters and node pools. Deploy two separate GKE clusters in your chosen regions configured. Enable the Gateway API --gateway-api=standard and the Cloud Storage FUSE CSI driver https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver --addons GcsFuseCsiDriver during cluster creation. Create dedicated TPU v6e node pools ct6e-standard-4t for both clusters. Enable managed DRANET on these TPU node pools https://docs.cloud.google.com/kubernetes-engine/docs/how-to/allocate-network-resources-dra enable-dra-driver-tpu by setting the flags ---accelerator-network-profile=auto , and --node-labels=cloud.google.com/gke-networking-dra-driver=true Next: Establish the global mesh via Fleet Registration. Register both GKE clusters to a unified GKE Fleet by following the fleet creation and registration setup https://cloud.google.com/kubernetes-engine/docs/how-to/creating-fleets . Enable Multi-Cluster Service Discovery and Multi-Cluster Ingress on your fleet. Designate your primary region as the configuration hub to act as the control plane for routing rules across both regions. Next: Deploy the AI workload. Use a temporary Kubernetes job to download the Gemma 3 gemma-3-27b-it model weights directly into your Cloud Storage bucket. Define a ResourceClaimTemplate that explicitly requests the managed DRANET device class deviceClassName: netdev.google.com with the allocation mode set to "All". Deploy your inference server e.g. vLLM on the TPU nodes in both regions. Ensure the pod spec utilizes node selectors for the 2x2 TPU topology, requests exactly 4 TPUs, and mounts the netdev claim. This guarantees your pods utilize the dedicated accelerator networking alongside standard Ethernet. Next: Configure the Multi-Cluster Inference Gateway. Install the necessary Custom Resource Definitions CRDs so Kubernetes can process specialized routing objects like the InferenceObjective . Deploy an AutoscalingMetric to track hardware utilization, such as KV cache usage. Use Helm to group the independent AI deployments from both regions into a single, logical InferencePool . Deploy the Cross-Region Gateway and its associated HTTPRoute to manage incoming global traffic. Apply health checks and backend policies to the pool to ensure load balancing relies on your custom hardware metrics. Configure an InferenceObjective to instruct the gateway to route prompts to the region with the highest availability, avoiding overloaded TPUs.