Scaling LLM Inference: Multi-Node KV Cache Offloading with GKE & Managed Lustre

Google Cloud introduced a multi-node KV cache offloading solution using GKE and Managed Lustre for large language model inference, achieving over 50% TCO savings and nearly 60% reduction in GPU-hour requirements for Llama-3.3-70B on a six-node A3 Mega cluster. The approach offloads shared prefilled KV caches to Lustre's high-performance tier with a 95% cache hit rate, and a hybrid CPU RAM offload extension improves TTFT by 40% and end-to-end latency by 30%.

Significant contributors to this article include Sneha Aradhey , Software Engineer, Google Kubernetes Engine, and Michael MacDonald , Sr Software Engineer, Google Cloud Managed Lustre. Enterprise production environments are shifting to distributed, multi-node architectures to serve long-context window lengths and agentic AI. As these workloads scale, KVCaches often outgrow local CPU RAM and host SSD cache tiers. To handle this, some setups attempt to pool node-local storage into a distributed layer such as multi-node pooled NVMe arrays . Pooling SSDs aggregates raw capacity and often leverages spare local drives, presenting clear advantages. However, there are some limitations: the approach requires the compute cluster to manage its own complex data distribution and cross-node replication. An alternative is to offload the attention state to a dedicated, high-performance external parallel filesystem. We utilize Google Cloud Managed Lustre with the llm-d offloading stack as a cluster-wide decentralized attention cache tier, bypassing host-level capacity limits and eliminating the networking overhead of managing local pooled drives. With this approach, we achieve efficiency at scale: Google Cloud Managed Lustre enables over 50% TCO savings and reduces GPU-hour requirements for Llama-3.3-70B inference on a six-node A3 Mega cluster by nearly 60%. These gains are realized by offloading shared, prefilled KV caches to Lustre’s high-performance tier with a 95% cache hit rate. Benchmark Configuration Model: Llama-3.3-70B Context Dynamics: Prompt length of 50,000 tokens, input question length of 256 tokens, and output length of 512 tokens. Extension of Lustre KV Cache solution with CPU RAM offload The Managed Lustre KV Cache offload architecture can be extended via integration of offload to CPU RAM. This hybrid approach significantly improves performance compared to CPU offload only https://github.com/llm-d/llm-d/tree/main/guides/tiered-prefix-cache llm-d-fs-connector--lustre , delivering approximately 40% improvement in Time to First Token TTFT and a 30% reduction in end-to-end latency, for Llama-3.3-70B inference. User Guide Architectural Components GKE GPU Nodes: Dedicated accelerator resources provisioned exclusively for high-throughput model execution and tensor-parallel operations. Managed Lustre: A shared, high-bandwidth parallel filesystem acting as a centralized external tier that caches prefilled attention states to eliminate redundant prefill computation. PVC Evictor : A scalable, distributed garbage collection service that tracks file access patterns and automatically removes Least-Recently-Used LRU cache chunks to maintain healthy storage headroom. Target Models This guide provides two distinct, validated tracks for deployment depending on your model preference: Qwen Series: Qwen/Qwen3.5-35B-A3B Gemma 4 Architecture: google/gemma-4-31B-it Architectural Diagram Before You Begin Before starting this deployment, ensure your Google Cloud project is properly configured: Quota: Verify you have sufficient quota for the selected accelerators in your chosen region, as well as adequate general CPU, memory, and Managed Lustre quotas. Validate Required IAM Permissions for Managed Lustre Prepare your Environment to Connect to Managed Lustre: Complete the “ Before You Begin https://docs.cloud.google.com/managed-lustre/docs/lustre-csi-driver-new-volume before you begin ” steps to enable APIs, set up environment variables, and set up your VPC. GKE Version: The Managed Lustre CSI driver https://docs.cloud.google.com/kubernetes-engine/docs/concepts/managed-lustre is supported on GKE versions 1.33 or later . For the best experience and default port 988 usage, GKE version 1.33.2-gke.4780000 or later is recommended. Overview of Required Steps - Create the GKE Cluster - Create the GPU Compute node pool - Provision Lustre storage - Deploy vLLM Serving Engine with Lustre - Deploy the PVC Evictor - Clean Up 1. Create the GKE Cluster Create a rapid-channel GKE cluster with Workload Identity and all necessary CSI storage add-ons enabled Lustre, GCSFuse and Persistent Disk . - code block - <ListValue: StructValue 'code', 'export CLUSTER NAME="<INSERT CLUSTER NAME "\r\nexport ZONE="<INSERT ZONE "\r\nexport PROJECT ID="<INSERT PROJECT "\r\nexport NETWORK NAME="<INSERT NETWORK "\r\n\r\ngcloud container clusters create "$CLUSTER NAME" \\\r\n --zone "$ZONE" \\\r\n --num-nodes "1" \\\r\n --network "${NETWORK NAME}" \\\r\n --addons "HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver,GcsFuseCsiDriver,LustreCsiDriver" \\\r\n --workload-pool "${PROJECT ID}.svc.id.goog" \\\r\n --enable-managed-prometheus \\\r\n --enable-ip-alias \\\r\n --enable-shielded-nodes \\\r\n --shielded-integrity-monitoring \\\r\n --no-shielded-secure-boot \\\r\n --node-locations "$ZONE" \\\r\n --network="${NETWORK NAME}" \\\r\n --gateway-api=standard' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da2c40 2. Create the GPU Compute Node Pool Provision an GPU VM node pool e.g. a3-megagpu-4g , a4-highgpu-4g , etc. . - code block - <ListValue: StructValue 'code', 'gcloud beta container node-pools create gpu-vm nodepool \\\r\n --location="$ZONE" \\\r\n --cluster="$CLUSTER NAME" \\\r\n --project="$PROJECT ID" \\\r\n --accelerator="type=<INSERT GPU ACCELERATOR NAME ,count=<INSERT GPU COUNT ,gpu-driver-version=LATEST" \\\r\n --machine-type="<INSERT GPU COMPUTE VM MACHINE TYPE " \\\r\n --num-nodes="<INSERT NODE COUNT " \\\r\n --enable-gvnic \\\r\n --no-enable-autoupgrade\r\n\r\n Fetch cluster credentials\r\ngcloud container clusters get-credentials "$CLUSTER NAME" --zone "$ZONE"' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da2610 3. Provision Lustre Storage Auto-provisioned Before deploying vLLM, you need to provision the Lustre storage. We use an auto-provisioned Lustre instance via a StorageClass and a PersistentVolumeClaim PVC . Create a file named lustre-pvc.yaml with the following content: - code block - <ListValue: StructValue 'code', 'apiVersion: storage.k8s.io/v1\r\nkind: StorageClass\r\nmetadata:\r\n name: lustre-class\r\nprovisioner: lustre.csi.storage.gke.io\r\nvolumeBindingMode: Immediate\r\nreclaimPolicy: Delete\r\nmountOptions:\r\n - localflock\r\nparameters:\r\n perUnitStorageThroughput: "<CHOOSE PERFORMANCE TIER " See options below.\r\n network: "<INSERT NETWORK NAME "\r\n---\r\napiVersion: v1\r\nkind: PersistentVolumeClaim\r\nmetadata:\r\n name: lustre-pvc\r\nspec:\r\n accessModes:\r\n - ReadWriteMany\r\n resources:\r\n requests:\r\n storage: <INSERT CAPACITY GiB Range from 9000Gi to 84016000Gi, increments and ranges are Lustre tier-dependent.\r\n storageClassName: lustre-class' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da2850 Notes: Performance tier options are “125”, “250”, “500”, and “1000”. Per-tier capacity ranges and increments can be found here https://docs.cloud.google.com/managed-lustre/docs/performance-tiers . Apply this manifest to provision the Lustre instance and observe provisioning: - code block - <ListValue: StructValue 'code', ' 1. Submit the file to the cluster finishes instantly \r\nkubectl apply -f lustre-pvc.yaml\r\n\r\n 2. Watch the live provisioning stream until it says "Bound"\r\nkubectl get pvc lustre-pvc -w' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da2940 4. Deploy vLLM Serving Engine with Lustre Step 4a: Create the Hugging Face Access Secret Before submitting the deployment manifest, you must provision your Hugging Face API token https://huggingface.co/docs/hub/en/security-tokens as a secure secret within the cluster. Run the following command, replacing <INSERT HF TOKEN with your token: - code block - <ListValue: StructValue 'code', 'kubectl create secret generic hf-token-secret \\\r\n --from-literal=token="<INSERT HF TOKEN " \\\r\n --namespace=default' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da2b80 Step 4b: Create the vLLM Deployment Manifest This complete Kubernetes manifest deploys the vLLM engine, configures the llmd-fs-connector for high-performance KV-caching, and mounts your parallel Lustre storage lustre-pvc . Common Manifest Choose between Qwen3.5 or gemma-4 Replace example values between < with appropriate values for your environment. - code block - <ListValue: StructValue 'code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: vllm-storage\r\n namespace: default\r\n labels:\r\n app: vllm-storage\r\nspec:\r\n replicas: 1\r\n selector:\r\n matchLabels:\r\n app: vllm-storage\r\n template:\r\n metadata:\r\n labels:\r\n app: vllm-storage\r\n spec:\r\n nodeSelector:\r\n cloud.google.com/gke-accelerator: nvidia-h100-80gb\r\n tolerations:\r\n - key: "nvidia.com/gpu"\r\n operator: "Exists"\r\n effect: "NoSchedule"\r\n securityContext:\r\n fsGroup: <YOUR NON ROOT GID \r\n runAsUser: <YOUR NON ROOT UID \r\n volumes:\r\n - name: lustre-storage\r\n persistentVolumeClaim:\r\n claimName: lustre-pvc\r\n - name: shm\r\n emptyDir:\r\n medium: Memory\r\n sizeLimit: "200Gi"\r\n containers:\r\n - name: vllm-storage\r\n image: vllm/vllm-openai:v0.23.0-cu129\r\n volumeMounts:\r\n - mountPath: /mnt/files-storage\r\n name: lustre-storage\r\n command:\r\n - "/bin/bash"\r\n args:\r\n - "-c"\r\n - |\r\n set -x\r\n export USER=vllm\r\n export LOGNAME=vllm\r\n pip install --user msgpack\r\n pip install \'llmd-fs-connector==0.23\' --extra-index-url https://llm-d.github.io/llm-d-kv-cache/simple/\r\n \r\n vllm serve <MODEL NAME \\ google/gemma-4-31B-it OR Qwen/Qwen3.5-35B-A3B\r\n --download-dir /model/models \\\r\n --load-format auto \\\r\n --kv-transfer-config \'{\r\n "kv connector": "MultiConnector",\r\n "kv role": "kv both",\r\n "kv connector extra config": {\r\n "connectors": \r\n {\r\n "kv connector": "OffloadingConnector",\r\n "kv role": "kv both",\r\n "kv connector extra config": {\r\n "cpu bytes to use": 64424509440,\r\n "lazy offload": true\r\n }\r\n },\r\n {\r\n "kv connector": "OffloadingConnector",\r\n "kv role": "kv both",\r\n "kv connector extra config": {\r\n "spec name": "SharedStorageOffloadingSpec",\r\n "spec module path": "llmd fs backend.spec",\r\n "shared storage path": "/mnt/files-storage/llmd-kv-cache/",\r\n "threads per gpu": 32,\r\n "block size": <BLOCK SIZE 256 for gemma or 528 for Qwen3.5\r\n }\r\n }\r\n \r\n }\r\n }\' \\\r\n --distributed executor backend "mp" \\\r\n --port 8000 \\\r\n --max num batched tokens 16384 \\\r\n --enable-chunked-prefill \\\r\n --max-model-len 32000 \\\r\n --gpu-memory-utilization 0.92 \\\r\n --tensor-parallel-size "4" \\\r\n --prefix-caching-hash-algo sha256 cbor \\\r\n --enable prefix caching \\\r\n --enforce-eager \\\r\n --no-disable-hybrid-kv-cache-manager\r\n env:\r\n - name: HUGGING FACE HUB TOKEN\r\n valueFrom:\r\n secretKeyRef:\r\n name: hf-token-secret\r\n key: token\r\n ... probes ...\r\n resources:\r\n requests:\r\n nvidia.com/gpu: "4"\r\n limits:\r\n nvidia.com/gpu: "4"' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da25b0 Note: Qwen-3.5 specifically requires a block size of 528 to avoid fragmentation, while Gemma 4 functions perfectly with the default 256 . Step 4c: Apply and Verify Deployment To apply this manifest to your cluster, run: - code block - <ListValue: StructValue 'code', 'kubectl apply -n default -f vllm-lustre-deployment.yaml' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da2400 Step 4d: Track Model Download Status Because large models can take some time to download on first boot, track the initialization logs directly by streaming the container logs: Bash - code block - <ListValue: StructValue 'code', 'kubectl rollout status deployment/vllm-storage' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da21f0 5. Deploy the PVC Evictor PVC Evictor Overview Architecture & Role The llmd fs backend connector offloads KV-cache blocks to Lustre but does not natively delete old cache files. Over time, the cache will fill the shared filesystem. The PVC Evictor acts as an external garbage collector that continuously monitors disk usage and evicts least-recently-used LRU files to maintain healthy storage headroom. Scaling & Sharding The PVC Evictor supports sharding and can be scaled to multiple replicas to match the capacity and performance of your Lustre instance. As a rule of thumb, you should deploy 1 evictor replica for each 72 TB of Lustre capacity to distribute the eviction load effectively without overwhelming the metadata servers. For large-scale deployments, the evictor can be configured to run with multiple shards. When running in multi-replica mode, the workload is partitioned across pods, with each pod managing a specific shard of the cache namespace. This prevents redundant metadata scans and race conditions. High-Performance Resource Requirements Running the evictor at high scale e.g., with 16 parallel crawler processes requires significant CPU and memory resources to handle the rapid scanning and queue management of millions of files. Ensure that the pods are provisioned with sufficient resources e.g., 12 CPU requests and 8Gi Memory requests and scheduled on appropriate node types such as c4-standard-16 . PVC Evictor Deployment Steps The PVC Evictor is deployed via Helm using the chart located in kv connectors/pvc evictor/helm . Step 5a: Create a Dedicated Node Pool for the Evictor Running the evictor at high scale requires significant CPU and memory. First, create a dedicated node pool using a high-performance machine type such as c4-standard-16 to accommodate the 12 CPU and 8Gi memory requests needed per pod. - code block - <ListValue: StructValue 'code', ' Create a dedicated node pool for the PVC Evictor\r\ngcloud container node-pools create evictor-pool \\\r\n --location="$ZONE" \\\r\n --cluster="$CLUSTER NAME" \\\r\n --project="$PROJECT ID" \\\r\n --machine-type="c4-standard-16" \\\r\n --num-nodes="1"' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da2bb0 Step 5b: Install via Helm High-Performance Configuration Deploy a scaled, high-performance evictor pool with 2 replicas to monitor lustre-pvc. This configuration uses 16 crawler processes per pod to handle massive file namespaces. Note on Security Contexts : To allow the evictor pod to delete files created by vLLM, it must run with matching security context IDs. Ensure the placeholders <YOUR NON ROOT GID and <YOUR NON ROOT UID exactly match the non-root values used in the securityContext of your vLLM deployment to ensure shared POSIX file permissions. - code block - <ListValue: StructValue 'code', 'git clone --depth 1 https://github.com/llm-d/llm-d-kv-cache.git\r\ncd llm-d-kv-cache/kv connectors/pvc evictor\r\n\r\nhelm install pvc-evictor ./helm \\\r\n --namespace default \\\r\n --set replicaCount=1 \\\r\n --set config.numCrawlerProcesses=16 \\\r\n --set config.deletionBatchSize=5000 \\\r\n --set config.fileQueueMinSize=1000000 \\\r\n --set config.fileQueueMaxsize=2000000 \\\r\n --set config.fileAccessTimeThresholdMinutes=10 \\\r\n --set securityContext.container.runAsNonRoot=false \\\r\n --set pvc.name="lustre-pvc" \\\r\n --set config.cleanupThreshold=85.0 \\\r\n --set config.targetThreshold=70.0 \\\r\n --set config.cacheDirectory="llmd-kv-cache" \\\r\n --set securityContext.pod.fsGroup=<YOUR NON ROOT GID \\\r\n --set securityContext.container.runAsUser=<YOUR NON ROOT UID \\\r\n --set resources.requests.cpu=12 \\\r\n --set resources.requests.memory=8Gi \\\r\n --set resources.limits.cpu=15 \\\r\n --set resources.limits.memory=16Gi \\\r\n --set nodeSelector."cloud\\.google\\.com/gke-nodepool"=evictor-pool \\\r\n --set securityContext.pod.seLinuxOptions.level="s0:c0\\,c1"' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da2ac0 Critical Parameters Explained: replicaCount=2 : Deploys 2 evictor pods. The Helm chart automatically configures sharding totalShards=2 when multiple replicas are used. config.numCrawlerProcesses=16 : Runs 16 parallel crawler threads per pod to scan the filesystem rapidly. config.deletionBatchSize=5000 : Deletes files in batches of 5000 to reduce metadata overhead. config.fileQueueMinSize & config.fileQueueMaxsize : Configures large memory queues 1M min, 2M max to buffer files for deletion, matching the high crawler throughput. config.fileAccessTimeThresholdMinutes=10 : Aggressively evicts files that haven't been accessed in the last 10 minutes when the cleanup threshold is triggered. securityContext.container.runAsNonRoot=false : Required if the evictor needs root-like permissions to manage/delete files across different user ownerships on the shared storage. resources.requests & limits : Allocates 12-15 CPUs and 8-16Gi of memory per pod to ensure the high number of crawler processes do not get CPU-throttled or run Out-Of-Memory OOM . Step 5c: Verify and Monitor - code block - <ListValue: StructValue 'code', ' Verify pod status\r\nkubectl get pods -l app.kubernetes.io/name=pvc-evictor -n default' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c56da2730 Step 6: Clean Up Because this deployment provisions significant and high-cost hardware, be sure to clean up your environment when you are done to avoid unnecessary charges. Bash - code block - <ListValue: StructValue 'code', 'helm uninstall pvc-evictor && kubectl delete -f vllm-lustre-deployment.yaml\r\n\r\nkubectl delete pvc lustre-pvc\r\n\r\n Delete the cluster this also deletes the associated node pools \r\ngcloud container clusters delete "$CLUSTER NAME" \\\r\n --zone "$ZONE" \\\r\n --project "$PROJECT ID" \\\r\n --quiet\r\n\r\n Note: The Lustre StorageClass reclaimPolicy is set to Delete, \r\n so destroying the PVC or Cluster will automatically clean up the underlying Lustre storage.' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c545e9100 Appendix: Reference Configuration for Llama-3.3-70B Benchmark The following configuration is a representation of the deployment manifest used to generate the Llama-3.3-70B benchmark results referenced in this post. It is provided for completeness and transparency. Note: This configuration utilizes an earlier iteration of the software stack vLLM v0.15.0 and specific infrastructure flags that were active in the benchmarking environment at the time the data was collected. - code block - <ListValue: StructValue 'code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: vllm-storage\r\n namespace: default\r\n labels:\r\n app: vllm-storage\r\nspec:\r\n replicas: 1\r\n selector:\r\n matchLabels:\r\n app: vllm-storage\r\n template:\r\n metadata:\r\n labels:\r\n app: vllm-storage\r\n spec:\r\n volumes:\r\n - name: lustre-storage\r\n persistentVolumeClaim:\r\n claimName: lustre-pvc\r\n - name: shm\r\n emptyDir:\r\n medium: Memory\r\n sizeLimit: "200Gi"\r\n - name: kv-store-disk\r\n persistentVolumeClaim:\r\n claimName: lustre-pvc\r\n containers:\r\n - name: vllm-storage\r\n image: vllm/vllm-openai:v0.15.0\r\n command:\r\n - "/bin/bash"\r\n args:\r\n - "-c"\r\n - |\r\n pip install https://raw.githubusercontent.com/kfirtoledo/llm-d-kv-cache-manager/connector/kv connectors/llmd fs backend/wheels/llmd fs connector-0.1.0-cp312-cp312-linux x86 64.whl; \\\r\n mkdir -p /tmp/prometheus metrics;\r\n export PROMETHEUS MULTIPROC DIR=/tmp/prometheus metrics; \\\r\n vllm serve meta-llama/Llama-3.3-70B-Instruct \\\r\n --download-dir /model/models \\\r\n --load-format runai streamer \\\r\n --kv-transfer-config \'{ \r\n "kv connector": "OffloadingConnector", \r\n "kv role": "kv both",\r\n "kv connector extra config": {\r\n "spec name": "SharedStorageOffloadingSpec",\r\n "spec module path": "llmd fs backend.spec",\r\n "shared storage path": "/mnt/files-storage/llmd-kv-cache/",\r\n "block size": 1024,\r\n "threads per gpu": "64"\r\n }\r\n }\' \\\r\n --distributed executor backend "mp" \\\r\n --port 8000 \\\r\n --max num batched tokens 16384 \\\r\n --enable-chunked-prefill \\\r\n --tensor-parallel-size 8 \\\r\n --enable prefix caching \\\r\n --gpu-memory-utilization 0.9\r\n env:\r\n - name: HUGGING FACE HUB TOKEN\r\n valueFrom:\r\n secretKeyRef:\r\n name: hf-token-secret\r\n key: token\r\n - name: VLLM EXECUTE MODEL TIMEOUT SECONDS\r\n value: "3000"\r\n - name: PYTHONHASHSEED\r\n value: "123"\r\n ports:\r\n - containerPort: 8000\r\n resources:\r\n limits:\r\n nvidia.com/gpu: "8"\r\n requests:\r\n cpu: "200"\r\n memory: 1024G\r\n ephemeral-storage: 5120Gi\r\n nvidia.com/gpu: "8"\r\n volumeMounts:\r\n - name: lustre-storage\r\n mountPath: /model\r\n - mountPath: /root/.cache/huggingface\r\n name: lustre-storage\r\n subPath: huggingface-cache\r\n - name: shm\r\n mountPath: /dev/shm\r\n - mountPath: /mnt/files-storage\r\n name: kv-store-disk\r\n ... probes omitted for brevity ...' , 'language', '' , 'caption', <wagtail.rich text.RichText object at 0x7f6c55f89a30