{"slug": "scaling-ray-serve-llm-on-gke-performance-without-losing-the-developer-experience", "title": "Scaling Ray Serve LLM on GKE: Performance without losing the developer experience", "summary": "Google Cloud and Anyscale announced optimizations for Ray Serve LLM on Google Kubernetes Engine (GKE) that deliver up to 5x higher throughput and 8x lower latency for large language model inference. The improvements include HAProxy integration, direct token streaming, and a v2 Ray executor backend for vLLM, enabling production-grade performance without sacrificing developer experience.", "body_md": "Developers looking for LLM inference and model serving often turn to [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), a scalable model serving library with developer-friendly, Python-native APIs built by Anyscale. Combined with Google Kubernetes Engine (GKE), developers have a powerful, unified platform optimized for demanding LLM serving use cases, spanning from initial model development to online production serving.\n\nHowever, that flexibility and feature set used to come at a cost to performance. But today, in partnership with Anyscale, [we are delivering up to 5x higher throughput and 8x lower latency in Ray Serve](https://www.anyscale.com/blog/high-performance-distributed-inference-ray-serve-llm-vllm-google-kubernetes-gke), meeting the growing demands and rigorous performance requirements of state-of-the-art distributed inference, without having to sacrifice ease of use.\n\nThrough our joint engineering partnership, we are introducing three major architectural optimizations that dramatically improve Ray Serve LLM's performance characteristics:\n\n**Ray Serve HAProxy integration**: Ray Serve now builds in HAProxy to manage internal request routing and load balancing. This setup drastically reduces proxy overhead and prevents the Python runtime from saturating under high traffic.\n\n**Direct token streaming architecture**: This architecture decouples the initial request path from the return stream. Tokens stream directly from individual model replicas back to the proxy, bypassing the ingress router completely for the streaming data path to cut latency.\n\n**v2 Ray executor backend for vLLM**: The revamped Ray backend for vLLM moves Ray out of the data plane to enable asynchronous scheduling. This unifies the code path with native vLLM executors, closing the performance gap and helping to ensure Ray users benefit from the latest engine-level optimizations.\n\nWe’ve also collaborated with Anyscale to benchmark the updated Ray Serve LLM on GKE clusters utilizing next-generation AI hardware, including Google Cloud A4 VMs powered by [NVIDIA HGX B200](https://www.nvidia.com/en-us/data-center/hgx/) systems. We chose to run Gemma 4 E2B as a small, efficient model to isolate bottlenecks introduced from orchestration and routing. Our benchmarks compared the new Ray Serve LLM to its prior performance, as well as a plain vLLM setup using the Ray executor.\n\nThese technical enhancements deliver a transformative impact on performance, offering up to **5x higher throughput and 8x better latency** compared to previous Ray Serve configurations.\n\nThe improved Ray Serve LLM demonstrated a remarkable improvement on a serving cluster with eight replicas, showing a scaling pattern that far exceeds previous performance, and showing comparable performance to running vLLM natively, but without the flexibility that Ray brings to the table.\n\nWe observe that with an increasing number of concurrent users, Ray is now able to scale up throughput while maintaining a low 99th percentile time-to-first-token, where previously it struggled. Now LLM practitioners don’t have to sacrifice Ray’s rich features and ecosystem to get production-grade performance on Kubernetes.\n\nGKE provides the foundational infrastructure that makes these software optimizations shine. When using the [Ray Operator add-on](https://docs.cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/concepts/overview) for GKE, you get turnkey deployment across Google Cloud's AI [accelerators](https://docs.cloud.google.com/kubernetes-engine/docs/tutorials/serve-llm-tpu-ray), including automated horizontal scaling, [monitoring](https://docs.cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/how-to/collect-view-logs-metrics), [multi-cluster scaling](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/serve-multi-cluster-ray-inference-gateway), and built-in fault tolerance. GKE abstracts the complex parts of orchestrating distributed physical hardware, so your team can focus on refining your models and application logic with Ray.\n\nWe encourage developers to try out these enhancements in the latest Ray release (2.56 and later) and experience the future of high-performance LLM serving on GKE.\n\nFor more details, check out the following resources:", "url": "https://wpnews.pro/news/scaling-ray-serve-llm-on-gke-performance-without-losing-the-developer-experience", "canonical_source": "https://cloud.google.com/blog/products/containers-kubernetes/improving-ray-serve-llm-on-gke-throughput-latency/", "published_at": "2026-06-18 16:00:00+00:00", "updated_at": "2026-06-18 16:30:17.455498+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-products", "ai-tools"], "entities": ["Google Cloud", "Anyscale", "Ray Serve", "Google Kubernetes Engine", "vLLM", "NVIDIA", "Gemma 4"], "alternates": {"html": "https://wpnews.pro/news/scaling-ray-serve-llm-on-gke-performance-without-losing-the-developer-experience", "markdown": "https://wpnews.pro/news/scaling-ray-serve-llm-on-gke-performance-without-losing-the-developer-experience.md", "text": "https://wpnews.pro/news/scaling-ray-serve-llm-on-gke-performance-without-losing-the-developer-experience.txt", "jsonld": "https://wpnews.pro/news/scaling-ray-serve-llm-on-gke-performance-without-losing-the-developer-experience.jsonld"}}