# Scaling LLM Inference: Multi-Node KV Cache Offloading with GKE & Managed Lustre

> Source: <https://cloud.google.com/blog/topics/developers-practitioners/scaling-llm-inference-multi-node-kv-cache-offloading-with-gke-managed-lustre/>
> Published: 2026-07-01 07:00:00+00:00

*Significant contributors to this article include ***Sneha Aradhey**, Software Engineer, Google Kubernetes Engine, and **Michael MacDonald**, Sr Software Engineer, Google Cloud Managed Lustre.

Enterprise production environments are shifting to distributed, multi-node architectures to serve long-context window lengths and agentic AI. As these workloads scale, KVCaches often outgrow local CPU RAM and host SSD cache tiers.

To handle this, some setups attempt to pool node-local storage into a distributed layer (such as multi-node pooled NVMe arrays). Pooling SSDs aggregates raw capacity and often leverages spare local drives, presenting clear advantages. However, there are some limitations: the approach requires the compute cluster to manage its own complex data distribution and cross-node replication.

An alternative is to offload the attention state to a dedicated, high-performance external parallel filesystem. We utilize **Google Cloud Managed Lustre with the llm-d offloading stack** as a cluster-wide decentralized attention cache tier, bypassing host-level capacity limits and eliminating the networking overhead of managing local pooled drives.

With this approach, we achieve efficiency at scale:

**Google Cloud Managed Lustre enables over 50% TCO savings and reduces GPU-hour requirements for Llama-3.3-70B inference on a six-node A3 Mega cluster by nearly 60%. These gains are realized by offloading shared, prefilled KV caches to Lustre’s high-performance tier with a 95% cache hit rate.**

#### Benchmark Configuration

**Model:** Llama-3.3-70B
**Context Dynamics:** Prompt length of 50,000 tokens, input question length of 256 tokens, and output length of 512 tokens.

#### Extension of Lustre KV Cache solution with CPU RAM offload

The Managed Lustre KV Cache offload architecture can be extended via integration of offload to CPU RAM. This hybrid approach [significantly improves performance compared to CPU offload only](https://github.com/llm-d/llm-d/tree/main/guides/tiered-prefix-cache#llm-d-fs-connector--lustre), delivering approximately 40% improvement in Time to First Token (TTFT) and a 30% reduction in end-to-end latency, for Llama-3.3-70B inference.

### User Guide

#### Architectural Components

**GKE GPU Nodes:** Dedicated accelerator resources provisioned exclusively for high-throughput model execution and tensor-parallel operations.
**Managed Lustre:** A shared, high-bandwidth parallel filesystem acting as a centralized external tier that caches prefilled attention states to eliminate redundant prefill computation.
**PVC Evictor****:** A scalable, distributed garbage collection service that tracks file access patterns and automatically removes Least-Recently-Used (LRU) cache chunks to maintain healthy storage headroom.

#### Target Models

This guide provides two distinct, validated tracks for deployment depending on your model preference:

**Qwen Series:** `Qwen/Qwen3.5-35B-A3B`

**Gemma 4 Architecture:** `google/gemma-4-31B-it`

#### Architectural Diagram

#### Before You Begin

Before starting this deployment, ensure your Google Cloud project is properly configured:

**Quota:** Verify you have sufficient quota for the selected accelerators in your chosen region, as well as adequate general CPU, memory, and Managed Lustre quotas.
**Validate Required IAM Permissions for Managed Lustre**
**Prepare your Environment to Connect to Managed Lustre:** Complete the “[Before You Begin](https://docs.cloud.google.com/managed-lustre/docs/lustre-csi-driver-new-volume#before_you_begin)” steps to enable APIs, set up environment variables, and set up your VPC.
**GKE Version:** The [Managed Lustre CSI driver](https://docs.cloud.google.com/kubernetes-engine/docs/concepts/managed-lustre) is supported on GKE versions **1.33 or later**. For the best experience and default port (988) usage, GKE version **1.33.2-gke.4780000 or later** is recommended.

#### Overview of Required Steps

- Create the GKE Cluster
- Create the GPU Compute node pool
- Provision Lustre storage
- Deploy vLLM Serving Engine with Lustre
- Deploy the PVC Evictor
- Clean Up

#### 1. Create the GKE Cluster

Create a rapid-channel GKE cluster with Workload Identity and all necessary CSI storage add-ons enabled (Lustre, GCSFuse and Persistent Disk).

- code_block
- <ListValue: [StructValue([('code', 'export CLUSTER_NAME="<INSERT CLUSTER NAME>"\r\nexport ZONE="<INSERT ZONE>"\r\nexport PROJECT_ID="<INSERT PROJECT>"\r\nexport NETWORK_NAME="<INSERT NETWORK>"\r\n\r\ngcloud container clusters create "$CLUSTER_NAME" \\\r\n --zone "$ZONE" \\\r\n --num-nodes "1" \\\r\n --network "${NETWORK_NAME}" \\\r\n --addons "HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver,GcsFuseCsiDriver,LustreCsiDriver" \\\r\n --workload-pool "${PROJECT_ID}.svc.id.goog" \\\r\n --enable-managed-prometheus \\\r\n --enable-ip-alias \\\r\n --enable-shielded-nodes \\\r\n --shielded-integrity-monitoring \\\r\n --no-shielded-secure-boot \\\r\n --node-locations "$ZONE" \\\r\n --network="${NETWORK_NAME}" \\\r\n --gateway-api=standard'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da2c40>)])]>

#### 2. Create the GPU Compute Node Pool

Provision an GPU VM node pool ( e.g. `a3-megagpu-4g`

, `a4-highgpu-4g`

, etc.).

- code_block
- <ListValue: [StructValue([('code', 'gcloud beta container node-pools create gpu-vm nodepool \\\r\n --location="$ZONE" \\\r\n --cluster="$CLUSTER_NAME" \\\r\n --project="$PROJECT_ID" \\\r\n --accelerator="type=<INSERT GPU_ACCELERATOR_NAME>,count=<INSERT GPU_COUNT>,gpu-driver-version=LATEST" \\\r\n --machine-type="<INSERT GPU_COMPUTE_VM_MACHINE TYPE>" \\\r\n --num-nodes="<INSERT NODE_COUNT>" \\\r\n --enable-gvnic \\\r\n --no-enable-autoupgrade\r\n\r\n# Fetch cluster credentials\r\ngcloud container clusters get-credentials "$CLUSTER_NAME" --zone "$ZONE"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da2610>)])]>

#### 3. Provision Lustre Storage (Auto-provisioned)

Before deploying vLLM, you need to provision the Lustre storage. We use an auto-provisioned Lustre instance via a `StorageClass`

and a `PersistentVolumeClaim`

(PVC).

Create a file named `lustre-pvc.yaml`

with the following content:

- code_block
- <ListValue: [StructValue([('code', 'apiVersion: storage.k8s.io/v1\r\nkind: StorageClass\r\nmetadata:\r\n name: lustre-class\r\nprovisioner: lustre.csi.storage.gke.io\r\nvolumeBindingMode: Immediate\r\nreclaimPolicy: Delete\r\nmountOptions:\r\n - localflock\r\nparameters:\r\n perUnitStorageThroughput: "<CHOOSE_PERFORMANCE_TIER>" # See options below.\r\n network: "<INSERT NETWORK_NAME>"\r\n---\r\napiVersion: v1\r\nkind: PersistentVolumeClaim\r\nmetadata:\r\n name: lustre-pvc\r\nspec:\r\n accessModes:\r\n - ReadWriteMany\r\n resources:\r\n requests:\r\n storage: <INSERT CAPACITY_GiB> # Range from 9000Gi to 84016000Gi, increments and ranges are Lustre tier-dependent.\r\n storageClassName: lustre-class'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da2850>)])]>

Notes: Performance tier options are “125”, “250”, “500”, and “1000”. Per-tier capacity ranges and increments can be found [here](https://docs.cloud.google.com/managed-lustre/docs/performance-tiers).

Apply this manifest to provision the Lustre instance and observe provisioning:

- code_block
- <ListValue: [StructValue([('code', '# 1. Submit the file to the cluster (finishes instantly)\r\nkubectl apply -f lustre-pvc.yaml\r\n\r\n# 2. Watch the live provisioning stream until it says "Bound"\r\nkubectl get pvc lustre-pvc -w'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da2940>)])]>

#### 4. Deploy vLLM Serving Engine with Lustre

**Step 4a: Create the Hugging Face Access Secret**

Before submitting the deployment manifest, you must provision your Hugging Face API [token](https://huggingface.co/docs/hub/en/security-tokens) as a secure secret within the cluster.

Run the following command, replacing `<INSERT_HF_TOKEN>` with your token:

- code_block
- <ListValue: [StructValue([('code', 'kubectl create secret generic hf-token-secret \\\r\n --from-literal=token="<INSERT_HF_TOKEN>" \\\r\n --namespace=default'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da2b80>)])]>

**Step 4b: Create the vLLM Deployment Manifest**

This complete Kubernetes manifest deploys the vLLM engine, configures the `llmd-fs-connector`

for high-performance KV-caching, and mounts your parallel Lustre storage (`lustre-pvc`

).

Common Manifest (Choose between Qwen3.5 or gemma-4)

Replace example values between <> with appropriate values for your environment.

- code_block
- <ListValue: [StructValue([('code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: vllm-storage\r\n namespace: default\r\n labels:\r\n app: vllm-storage\r\nspec:\r\n replicas: 1\r\n selector:\r\n matchLabels:\r\n app: vllm-storage\r\n template:\r\n metadata:\r\n labels:\r\n app: vllm-storage\r\n spec:\r\n nodeSelector:\r\n cloud.google.com/gke-accelerator: nvidia-h100-80gb\r\n tolerations:\r\n - key: "nvidia.com/gpu"\r\n operator: "Exists"\r\n effect: "NoSchedule"\r\n securityContext:\r\n fsGroup: <YOUR_NON_ROOT_GID>\r\n runAsUser: <YOUR_NON_ROOT_UID>\r\n volumes:\r\n - name: lustre-storage\r\n persistentVolumeClaim:\r\n claimName: lustre-pvc\r\n - name: shm\r\n emptyDir:\r\n medium: Memory\r\n sizeLimit: "200Gi"\r\n containers:\r\n - name: vllm-storage\r\n image: vllm/vllm-openai:v0.23.0-cu129\r\n volumeMounts:\r\n - mountPath: /mnt/files-storage\r\n name: lustre-storage\r\n command:\r\n - "/bin/bash"\r\n args:\r\n - "-c"\r\n - |\r\n set -x\r\n export USER=vllm\r\n export LOGNAME=vllm\r\n pip install --user msgpack\r\n pip install \'llmd-fs-connector==0.23\' --extra-index-url https://llm-d.github.io/llm-d-kv-cache/simple/\r\n \r\n vllm serve <MODEL_NAME> \\ # google/gemma-4-31B-it OR Qwen/Qwen3.5-35B-A3B\r\n --download-dir /model/models \\\r\n --load-format auto \\\r\n --kv-transfer-config \'{\r\n "kv_connector": "MultiConnector",\r\n "kv_role": "kv_both",\r\n "kv_connector_extra_config": {\r\n "connectors": [\r\n {\r\n "kv_connector": "OffloadingConnector",\r\n "kv_role": "kv_both",\r\n "kv_connector_extra_config": {\r\n "cpu_bytes_to_use": 64424509440,\r\n "lazy_offload": true\r\n }\r\n },\r\n {\r\n "kv_connector": "OffloadingConnector",\r\n "kv_role": "kv_both",\r\n "kv_connector_extra_config": {\r\n "spec_name": "SharedStorageOffloadingSpec",\r\n "spec_module_path": "llmd_fs_backend.spec",\r\n "shared_storage_path": "/mnt/files-storage/llmd-kv-cache/",\r\n "threads_per_gpu": 32,\r\n "block_size": <BLOCK_SIZE> # 256 for gemma or 528 for Qwen3.5\r\n }\r\n }\r\n ]\r\n }\r\n }\' \\\r\n --distributed_executor_backend "mp" \\\r\n --port 8000 \\\r\n --max_num_batched_tokens 16384 \\\r\n --enable-chunked-prefill \\\r\n --max-model-len 32000 \\\r\n --gpu-memory-utilization 0.92 \\\r\n --tensor-parallel-size "4" \\\r\n --prefix-caching-hash-algo sha256_cbor \\\r\n --enable_prefix_caching \\\r\n --enforce-eager \\\r\n --no-disable-hybrid-kv-cache-manager\r\n env:\r\n - name: HUGGING_FACE_HUB_TOKEN\r\n valueFrom:\r\n secretKeyRef:\r\n name: hf-token-secret\r\n key: token\r\n # ... probes ...\r\n resources:\r\n requests:\r\n nvidia.com/gpu: "4"\r\n limits:\r\n nvidia.com/gpu: "4"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da25b0>)])]>

Note: Qwen-3.5 specifically requires a block size of `528`

to avoid fragmentation, while Gemma 4 functions perfectly with the default `256`

.

**Step 4c: Apply and Verify Deployment**

To apply this manifest to your cluster, run:

- code_block
- <ListValue: [StructValue([('code', 'kubectl apply -n default -f vllm-lustre-deployment.yaml'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da2400>)])]>

**Step 4d: Track Model Download Status**

Because large models can take some time to download on first boot, track the initialization logs directly by streaming the container logs:

Bash

- code_block
- <ListValue: [StructValue([('code', 'kubectl rollout status deployment/vllm-storage'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da21f0>)])]>

#### 5. Deploy the PVC Evictor

##### PVC Evictor Overview

**Architecture & Role**

The `llmd_fs_backend`

connector offloads KV-cache blocks to Lustre but does not natively delete old cache files. Over time, the cache will fill the shared filesystem. The **PVC Evictor** acts as an external garbage collector that continuously monitors disk usage and evicts least-recently-used (LRU) files to maintain healthy storage headroom.

**Scaling & Sharding**

The PVC Evictor supports sharding and can be scaled to multiple replicas to match the capacity and performance of your Lustre instance. As a rule of thumb, you should deploy **1 evictor replica for each 72 TB of Lustre capacity** to distribute the eviction load effectively without overwhelming the metadata servers.

For large-scale deployments, the evictor can be configured to run with multiple shards. When running in multi-replica mode, the workload is partitioned across pods, with each pod managing a specific shard of the cache namespace. This prevents redundant metadata scans and race conditions.

**High-Performance Resource Requirements**

Running the evictor at high scale (e.g., with 16 parallel crawler processes) requires significant CPU and memory resources to handle the rapid scanning and queue management of millions of files. Ensure that the pods are provisioned with sufficient resources (e.g., 12 CPU requests and 8Gi Memory requests) and scheduled on appropriate node types (such as `c4-standard-16`

).

**PVC Evictor Deployment Steps**

The PVC Evictor is deployed via Helm using the chart located in `kv_connectors/pvc_evictor/helm`

.

**Step 5a: Create a Dedicated Node Pool for the Evictor**

Running the evictor at high scale requires significant CPU and memory. First, create a dedicated node pool using a high-performance machine type (such as c4-standard-16) to accommodate the 12 CPU and 8Gi memory requests needed per pod.

- code_block
- <ListValue: [StructValue([('code', '# Create a dedicated node pool for the PVC Evictor\r\ngcloud container node-pools create evictor-pool \\\r\n --location="$ZONE" \\\r\n --cluster="$CLUSTER_NAME" \\\r\n --project="$PROJECT_ID" \\\r\n --machine-type="c4-standard-16" \\\r\n --num-nodes="1"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da2bb0>)])]>

**Step 5b: Install via Helm (High-Performance Configuration)**

Deploy a scaled, high-performance evictor pool with 2 replicas to monitor lustre-pvc. This configuration uses 16 crawler processes per pod to handle massive file namespaces.

**Note on Security Contexts**: To allow the evictor pod to delete files created by vLLM, it must run with matching security context IDs. Ensure the placeholders `<YOUR_NON_ROOT_GID>`

and `<YOUR_NON_ROOT_UID>`

exactly match the non-root values used in the `securityContext`

of your vLLM deployment to ensure shared POSIX file permissions.

- code_block
- <ListValue: [StructValue([('code', 'git clone --depth 1 https://github.com/llm-d/llm-d-kv-cache.git\r\ncd llm-d-kv-cache/kv_connectors/pvc_evictor\r\n\r\nhelm install pvc-evictor ./helm \\\r\n --namespace default \\\r\n --set replicaCount=1 \\\r\n --set config.numCrawlerProcesses=16 \\\r\n --set config.deletionBatchSize=5000 \\\r\n --set config.fileQueueMinSize=1000000 \\\r\n --set config.fileQueueMaxsize=2000000 \\\r\n --set config.fileAccessTimeThresholdMinutes=10 \\\r\n --set securityContext.container.runAsNonRoot=false \\\r\n --set pvc.name="lustre-pvc" \\\r\n --set config.cleanupThreshold=85.0 \\\r\n --set config.targetThreshold=70.0 \\\r\n --set config.cacheDirectory="llmd-kv-cache" \\\r\n --set securityContext.pod.fsGroup=<YOUR_NON_ROOT_GID> \\\r\n --set securityContext.container.runAsUser=<YOUR_NON_ROOT_UID> \\\r\n --set resources.requests.cpu=12 \\\r\n --set resources.requests.memory=8Gi \\\r\n --set resources.limits.cpu=15 \\\r\n --set resources.limits.memory=16Gi \\\r\n --set nodeSelector."cloud\\.google\\.com/gke-nodepool"=evictor-pool \\\r\n --set securityContext.pod.seLinuxOptions.level="s0:c0\\,c1"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da2ac0>)])]>

#### Critical Parameters Explained:

`replicaCount=2`

: Deploys 2 evictor pods. The Helm chart automatically configures sharding (`totalShards=2`

) when multiple replicas are used.
`config.numCrawlerProcesses=16`

: Runs 16 parallel crawler threads per pod to scan the filesystem rapidly.
`config.deletionBatchSize=5000`

: Deletes files in batches of 5000 to reduce metadata overhead.
`config.fileQueueMinSize`

& `config.fileQueueMaxsize`

: Configures large memory queues (1M min, 2M max) to buffer files for deletion, matching the high crawler throughput.
`config.fileAccessTimeThresholdMinutes=10`

: Aggressively evicts files that haven't been accessed in the last 10 minutes when the cleanup threshold is triggered.
`securityContext.container.runAsNonRoot=false`

: Required if the evictor needs root-like permissions to manage/delete files across different user ownerships on the shared storage.
`resources.requests`

& `limits`

: Allocates 12-15 CPUs and 8-16Gi of memory per pod to ensure the high number of crawler processes do not get CPU-throttled or run Out-Of-Memory (OOM).

**Step 5c: Verify and Monitor**

- code_block
- <ListValue: [StructValue([('code', '# Verify pod status\r\nkubectl get pods -l app.kubernetes.io/name=pvc-evictor -n default'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c56da2730>)])]>

#### Step 6: Clean Up

Because this deployment provisions significant and high-cost hardware, be sure to clean up your environment when you are done to avoid unnecessary charges.

Bash

- code_block
- <ListValue: [StructValue([('code', 'helm uninstall pvc-evictor && kubectl delete -f vllm-lustre-deployment.yaml\r\n\r\nkubectl delete pvc lustre-pvc\r\n\r\n# Delete the cluster (this also deletes the associated node pools)\r\ngcloud container clusters delete "$CLUSTER_NAME" \\\r\n --zone "$ZONE" \\\r\n --project "$PROJECT_ID" \\\r\n --quiet\r\n\r\n# Note: The Lustre StorageClass reclaimPolicy is set to Delete, \r\n# so destroying the PVC or Cluster will automatically clean up the underlying Lustre storage.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c545e9100>)])]>

### Appendix: Reference Configuration for Llama-3.3-70B Benchmark

The following configuration is a representation of the deployment manifest used to generate the Llama-3.3-70B benchmark results referenced in this post. It is provided for completeness and transparency.

Note: This configuration utilizes an earlier iteration of the software stack (vLLM v0.15.0) and specific infrastructure flags that were active in the benchmarking environment at the time the data was collected.

- code_block
- <ListValue: [StructValue([('code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: vllm-storage\r\n namespace: default\r\n labels:\r\n app: vllm-storage\r\nspec:\r\n replicas: 1\r\n selector:\r\n matchLabels:\r\n app: vllm-storage\r\n template:\r\n metadata:\r\n labels:\r\n app: vllm-storage\r\n spec:\r\n volumes:\r\n - name: lustre-storage\r\n persistentVolumeClaim:\r\n claimName: lustre-pvc\r\n - name: shm\r\n emptyDir:\r\n medium: Memory\r\n sizeLimit: "200Gi"\r\n - name: kv-store-disk\r\n persistentVolumeClaim:\r\n claimName: lustre-pvc\r\n containers:\r\n - name: vllm-storage\r\n image: vllm/vllm-openai:v0.15.0\r\n command:\r\n - "/bin/bash"\r\n args:\r\n - "-c"\r\n - |\r\n pip install https://raw.githubusercontent.com/kfirtoledo/llm-d-kv-cache-manager/connector/kv_connectors/llmd_fs_backend/wheels/llmd_fs_connector-0.1.0-cp312-cp312-linux_x86_64.whl; \\\r\n mkdir -p /tmp/prometheus_metrics;\r\n export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus_metrics; \\\r\n vllm serve meta-llama/Llama-3.3-70B-Instruct \\\r\n --download-dir /model/models \\\r\n --load-format runai_streamer \\\r\n --kv-transfer-config \'{ \r\n "kv_connector": "OffloadingConnector", \r\n "kv_role": "kv_both",\r\n "kv_connector_extra_config": {\r\n "spec_name": "SharedStorageOffloadingSpec",\r\n "spec_module_path": "llmd_fs_backend.spec",\r\n "shared_storage_path": "/mnt/files-storage/llmd-kv-cache/",\r\n "block_size": 1024,\r\n "threads_per_gpu": "64"\r\n }\r\n }\' \\\r\n --distributed_executor_backend "mp" \\\r\n --port 8000 \\\r\n --max_num_batched_tokens 16384 \\\r\n --enable-chunked-prefill \\\r\n --tensor-parallel-size 8 \\\r\n --enable_prefix_caching \\\r\n --gpu-memory-utilization 0.9\r\n env:\r\n - name: HUGGING_FACE_HUB_TOKEN\r\n valueFrom:\r\n secretKeyRef:\r\n name: hf-token-secret\r\n key: token\r\n - name: VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS\r\n value: "3000"\r\n - name: PYTHONHASHSEED\r\n value: "123"\r\n ports:\r\n - containerPort: 8000\r\n resources:\r\n limits:\r\n nvidia.com/gpu: "8"\r\n requests:\r\n cpu: "200"\r\n memory: 1024G\r\n ephemeral-storage: 5120Gi\r\n nvidia.com/gpu: "8"\r\n volumeMounts:\r\n - name: lustre-storage\r\n mountPath: /model\r\n - mountPath: /root/.cache/huggingface\r\n name: lustre-storage\r\n subPath: huggingface-cache\r\n - name: shm\r\n mountPath: /dev/shm\r\n - mountPath: /mnt/files-storage\r\n name: kv-store-disk\r\n # ... probes omitted for brevity ...'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f6c55f89a30>)])]>
