Scaling LLM Inference: Multi-Node KV Cache Offloading with GKE & Managed Lustre Google Cloud introduced a multi-node KV cache offloading solution using GKE and Managed Lustre for large language model inference, achieving over 50% TCO savings and nearly 60% reduction in GPU-hour requirements for Llama-3.3-70B on a six-node A3 Mega cluster. The approach offloads shared prefilled KV caches to Lustre's high-performance tier with a 95% cache hit rate, and a hybrid CPU RAM offload extension improves TTFT by 40% and end-to-end latency by 30%. Significant contributors to this article include Sneha Aradhey , Software Engineer, Google Kubernetes Engine, and Michael MacDonald , Sr Software Engineer, Google Cloud Managed Lustre. Enterprise production environments are shifting to distributed, multi-node architectures to serve long-context window lengths and agentic AI. As these workloads scale, KVCaches often outgrow local CPU RAM and host SSD cache tiers. To handle this, some setups attempt to pool node-local storage into a distributed layer such as multi-node pooled NVMe arrays . Pooling SSDs aggregates raw capacity and often leverages spare local drives, presenting clear advantages. However, there are some limitations: the approach requires the compute cluster to manage its own complex data distribution and cross-node replication. An alternative is to offload the attention state to a dedicated, high-performance external parallel filesystem. We utilize Google Cloud Managed Lustre with the llm-d offloading stack as a cluster-wide decentralized attention cache tier, bypassing host-level capacity limits and eliminating the networking overhead of managing local pooled drives. With this approach, we achieve efficiency at scale: Google Cloud Managed Lustre enables over 50% TCO savings and reduces GPU-hour requirements for Llama-3.3-70B inference on a six-node A3 Mega cluster by nearly 60%. These gains are realized by offloading shared, prefilled KV caches to Lustre’s high-performance tier with a 95% cache hit rate. Benchmark Configuration Model: Llama-3.3-70B Context Dynamics: Prompt length of 50,000 tokens, input question length of 256 tokens, and output length of 512 tokens. Extension of Lustre KV Cache solution with CPU RAM offload The Managed Lustre KV Cache offload architecture can be extended via integration of offload to CPU RAM. This hybrid approach significantly improves performance compared to CPU offload only https://github.com/llm-d/llm-d/tree/main/guides/tiered-prefix-cache llm-d-fs-connector--lustre , delivering approximately 40% improvement in Time to First Token TTFT and a 30% reduction in end-to-end latency, for Llama-3.3-70B inference. User Guide Architectural Components GKE GPU Nodes: Dedicated accelerator resources provisioned exclusively for high-throughput model execution and tensor-parallel operations. Managed Lustre: A shared, high-bandwidth parallel filesystem acting as a centralized external tier that caches prefilled attention states to eliminate redundant prefill computation. PVC Evictor : A scalable, distributed garbage collection service that tracks file access patterns and automatically removes Least-Recently-Used LRU cache chunks to maintain healthy storage headroom. Target Models This guide provides two distinct, validated tracks for deployment depending on your model preference: Qwen Series: Qwen/Qwen3.5-35B-A3B Gemma 4 Architecture: google/gemma-4-31B-it Architectural Diagram Before You Begin Before starting this deployment, ensure your Google Cloud project is properly configured: Quota: Verify you have sufficient quota for the selected accelerators in your chosen region, as well as adequate general CPU, memory, and Managed Lustre quotas. Validate Required IAM Permissions for Managed Lustre Prepare your Environment to Connect to Managed Lustre: Complete the “ Before You Begin https://docs.cloud.google.com/managed-lustre/docs/lustre-csi-driver-new-volume before you begin ” steps to enable APIs, set up environment variables, and set up your VPC. GKE Version: The Managed Lustre CSI driver https://docs.cloud.google.com/kubernetes-engine/docs/concepts/managed-lustre is supported on GKE versions 1.33 or later . For the best experience and default port 988 usage, GKE version 1.33.2-gke.4780000 or later is recommended. Overview of Required Steps - Create the GKE Cluster - Create the GPU Compute node pool - Provision Lustre storage - Deploy vLLM Serving Engine with Lustre - Deploy the PVC Evictor - Clean Up 1. Create the GKE Cluster Create a rapid-channel GKE cluster with Workload Identity and all necessary CSI storage add-ons enabled Lustre, GCSFuse and Persistent Disk . - code block -