Experimenting with TPUs, GKE Managed DRANET, and Multi-cluster Inference Gateway

wpnews.pro

cd /news/artificial-intelligence/experimenting-with-tpus-gke-managed-… · home › topics › artificial-intelligence › article

[ARTICLE · art-19252] src=cloud.google.com ↗ pub=2026-06-02T07:00Z topic=artificial-intelligence verified=true sentiment=↑ positive

Experimenting with TPUs, GKE Managed DRANET, and Multi-cluster Inference Gateway

Google engineers deployed a Gemma 3 large language model across two GKE clusters in separate regions, using four TPU v6e chips per cluster, to test multi-cluster failover with the Inference Gateway. The setup leveraged managed Dynamic Resource Allocation (DRANET) and Cloud Storage FUSE to serve the model, with traffic routing to the nearest region and automatically failing over if one region went down. The experiment demonstrated how Kubernetes' Dynamic Resource Allocation and Inference Gateway can maintain AI workload availability across regions.

read4 min views19 publishedJun 2, 2026

What happens when your workload fails in one region but you need access to service? This is a common case for availability and uptime. With recent enhancement to the Kubernetes ecosystem and capabilities like Dynamic Resource Allocation (DRA) and Inference Gateway. I decided to experiment with these capabilities in Google Cloud for a simple test using an AI inference workload.

In this blog, we will explore this setup and you can also jump straight into the detailed configs in this codelab Build multi-cluster GKE Inference Gateway, with TPUs , Cloud Storage FUSE and managed DRANET.

To build out this experiment, use the following products, features, and tools:

**Google Kubernetes Engine ** (GKE) managed DRANET: This is a managed feature that lets you request and share resources among Pods. This supports

Multi-cluster GKE Inference gateway: Load balances your AI/ML inference workloads across multiple GKE clusters. This works in a failover situation which is what my experiment intended to test. The type which supports this is the

gke-l7-cross-regional-internal-managed-mc Cloud Storage FUSE: Provides a way to store data, models, checkpoints, and logs directly in Cloud Storage. To speed up the deployment, an open source gemma model was downloaded to this storage for retrieval.

Virtual private Cloud (VPC): The foundational global network providing isolated, secure communication for the internal load balancers and compute nodes

GKE Fleets: Fleets group the separate regional clusters under a unified management control plane

TPU v6e**:** Google's custom AI accelerators that provide the high-performance compute required to serve the model. The VM family type used was the ct6e-standard-4t

in a 2x2 Slice The aim is to deploy a LLM model (Gemma 3) onto 2 GKE clusters in different regions. Each cluster will use 4 TPU v6e chips. The model should be stored in Cloud Storage. The workload is served using GKE Inference Gateway which supports multi-clusters. The traffic should be routed to the region closest to the user and failover to the other region if one region fails.

Begin: Set up the environment.

Create a standard VPC, with firewall rules and subnet in the same zone as the reservation. Create a proxy-only subnet this will be used with the Internal regional application load balancer attached to the GKE inference gateway

Set up firewall rules allowing traffic and health checks.

Reserve static internal IP addresses in both regions for the Gateway.

Provision a Cloud Storage FUSE bucket and configure a dedicated IAM Service Account. Bind this to a Kubernetes Workload Identity so your pods can securely mount the bucket and read the model weights directly.

**Next: **Create standard GKE clusters and node pools.

Deploy two separate GKE clusters in your chosen regions configured.

Enable the Gateway API (`--gateway-api=standard`

) and the[ Cloud Storage FUSE CSI driver](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) (`--addons GcsFuseCsiDriver`

) during cluster creation.

Create dedicated TPU v6e node pools (ct6e-standard-4t ) for both clusters.

Enable managed DRANET on these TPU node pools by setting the flags ---accelerator-network-profile=auto

, and --node-labels=cloud.google.com/gke-networking-dra-driver=true **Next: **Establish the global mesh via Fleet Registration.

Enable Multi-Cluster Service Discovery and Multi-Cluster Ingress on your fleet.

Designate your primary region as the configuration hub to act as the control plane for routing rules across both regions.

**Next: **Deploy the AI workload.

Use a temporary Kubernetes job to download the Gemma 3 (gemma-3-27b-it ) model weights directly into your Cloud Storage bucket.

Define a ResourceClaimTemplate

that explicitly requests the managed DRANET device class (deviceClassName: netdev.google.com

) with the allocation mode set to "All".

Deploy your inference server (e.g. vLLM) on the TPU nodes in both regions. Ensure the pod spec utilizes node selectors for the 2x2 TPU topology, requests exactly 4 TPUs, and mounts the netdev

claim. This guarantees your pods utilize the dedicated accelerator networking alongside standard Ethernet.

**Next: **Configure the Multi-Cluster Inference Gateway.

Install the necessary Custom Resource Definitions (CRDs) so Kubernetes can process specialized routing objects like the InferenceObjective

Deploy an AutoscalingMetric

to track hardware utilization, such as KV cache usage.

Use Helm to group the independent AI deployments from both regions into a single, logical InferencePool .

Deploy the Cross-Region Gateway and its associated HTTPRoute

to manage incoming global traffic.

Apply health checks and backend policies to the pool to ensure load balancing relies on your custom hardware metrics.

Configure an InferenceObjective

to instruct the gateway to route prompts to the region with the highest availability, avoiding overloaded TPUs.

source & further reading

cloud.google.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/experimenting-with-tpus-…

Read original on cloud.google.com → cloud.google.com/blog/topics/developers-practiti…

mentioned entities

Google Kubernetes Engine

GKE

Dynamic Resource Allocation

DRA

Inference Gateway

Cloud Storage FUSE

DRANET

metadata

slugexperimenting-with-tpus-gke-managed-dranet-and-multi-cluster-inference-gateway

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicalcloud.google.com

navigation

← prevMCP Observability: Why AI Tool C…

next →The AI pricing conundrum — it st…

── more in #artificial-intelligence 4 stories · sorted by recency

cloud.google.com · 16 Jul · #artificial-intelligence

Securing AI at Enterprise Scale: The Google Kubernetes Engine Blueprint

pub.towardsai.net · 25 Jun · #artificial-intelligence

Google Turned LLM Load Balancing Into Scheduling. What That Means for the Rest of Us

cloud.google.com · 9 Jun · #artificial-intelligence

Report: GKE Inference Gateway delivers up to 92% faster AI responses

dev.to · 24 Jul · #artificial-intelligence

I shipped three AI tools that run entirely in the browser — here's everything that broke

── more on @google kubernetes engine 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 23 Jul · #artificial-intelligence

Wenfeng Liang: Four-Hour Investor Meeting Transcript

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required