What happens when your workload fails in one region but you need access to service? This is a common case for availability and uptime. With recent enhancement to the Kubernetes ecosystem and capabilities like Dynamic Resource Allocation (DRA) and Inference Gateway. I decided to experiment with these capabilities in Google Cloud for a simple test using an AI inference workload.
In this blog, we will explore this setup and you can also jump straight into the detailed configs in this codelab Build multi-cluster GKE Inference Gateway, with TPUs , Cloud Storage FUSE and managed DRANET.
To build out this experiment, use the following products, features, and tools:
**Google Kubernetes Engine ** (GKE) managed DRANET: This is a managed feature that lets you request and share resources among Pods. This supports
Multi-cluster GKE Inference gateway: Load balances your AI/ML inference workloads across multiple GKE clusters. This works in a failover situation which is what my experiment intended to test. The type which supports this is the
gke-l7-cross-regional-internal-managed-mc
Cloud Storage FUSE: Provides a way to store data, models, checkpoints, and logs directly in Cloud Storage. To speed up the deployment, an open source gemma model was downloaded to this storage for retrieval.
Virtual private Cloud (VPC): The foundational global network providing isolated, secure communication for the internal load balancers and compute nodes
GKE Fleets: Fleets group the separate regional clusters under a unified management control plane
TPU v6e**:** Google's custom AI accelerators that provide the high-performance compute required to serve the model. The VM family type used was the ct6e-standard-4t
in a 2x2 Slice The aim is to deploy a LLM model (Gemma 3) onto 2 GKE clusters in different regions. Each cluster will use 4 TPU v6e chips. The model should be stored in Cloud Storage. The workload is served using GKE Inference Gateway which supports multi-clusters. The traffic should be routed to the region closest to the user and failover to the other region if one region fails.
Begin: Set up the environment.
Create a standard VPC, with firewall rules and subnet in the same zone as the reservation. Create a proxy-only subnet this will be used with the Internal regional application load balancer attached to the GKE inference gateway
Set up firewall rules allowing traffic and health checks.
Reserve static internal IP addresses in both regions for the Gateway.
Provision a Cloud Storage FUSE bucket and configure a dedicated IAM Service Account. Bind this to a Kubernetes Workload Identity so your pods can securely mount the bucket and read the model weights directly.
**Next: **Create standard GKE clusters and node pools.
Deploy two separate GKE clusters in your chosen regions configured.
Enable the Gateway API (`--gateway-api=standard`
) and the[ Cloud Storage FUSE CSI driver](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) (`--addons GcsFuseCsiDriver`
) during cluster creation.
Create dedicated TPU v6e node pools (ct6e-standard-4t
) for both clusters.
Enable managed DRANET on these TPU node pools by setting the flags ---accelerator-network-profile=auto
, and --node-labels=cloud.google.com/gke-networking-dra-driver=true
**Next: **Establish the global mesh via Fleet Registration.
Register both GKE clusters to a unified GKE Fleet by following the fleet creation and registration setup.
Enable Multi-Cluster Service Discovery and Multi-Cluster Ingress on your fleet.
Designate your primary region as the configuration hub to act as the control plane for routing rules across both regions.
**Next: **Deploy the AI workload.
Use a temporary Kubernetes job to download the Gemma 3 (gemma-3-27b-it
) model weights directly into your Cloud Storage bucket.
Define a ResourceClaimTemplate
that explicitly requests the managed DRANET device class (deviceClassName: netdev.google.com
) with the allocation mode set to "All".
Deploy your inference server (e.g. vLLM) on the TPU nodes in both regions. Ensure the pod spec utilizes node selectors for the 2x2 TPU topology, requests exactly 4 TPUs, and mounts the netdev
claim. This guarantees your pods utilize the dedicated accelerator networking alongside standard Ethernet.
**Next: **Configure the Multi-Cluster Inference Gateway.
Install the necessary Custom Resource Definitions (CRDs) so Kubernetes can process specialized routing objects like the InferenceObjective
.
Deploy an AutoscalingMetric
to track hardware utilization, such as KV cache usage.
Use Helm to group the independent AI deployments from both regions into a single, logical InferencePool
.
Deploy the Cross-Region Gateway and its associated HTTPRoute
to manage incoming global traffic.
Apply health checks and backend policies to the pool to ensure load balancing relies on your custom hardware metrics.
Configure an InferenceObjective
to instruct the gateway to route prompts to the region with the highest availability, avoiding overloaded TPUs.