Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

wpnews.pro

cd /news/large-language-models/scaling-ray-serve-llm-on-gke-perform… · home › topics › large-language-models › article

[ARTICLE · art-32847] src=cloud.google.com ↗ pub=2026-06-18T16:00Z topic=large-language-models verified=true sentiment=↑ positive

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

Google Cloud and Anyscale announced optimizations for Ray Serve LLM on Google Kubernetes Engine (GKE) that deliver up to 5x higher throughput and 8x lower latency for large language model inference. The improvements include HAProxy integration, direct token streaming, and a v2 Ray executor backend for vLLM, enabling production-grade performance without sacrificing developer experience.

read3 min views1 publishedJun 18, 2026

Developers looking for LLM inference and model serving often turn to Ray Serve, a scalable model serving library with developer-friendly, Python-native APIs built by Anyscale. Combined with Google Kubernetes Engine (GKE), developers have a powerful, unified platform optimized for demanding LLM serving use cases, spanning from initial model development to online production serving.

However, that flexibility and feature set used to come at a cost to performance. But today, in partnership with Anyscale, we are delivering up to 5x higher throughput and 8x lower latency in Ray Serve, meeting the growing demands and rigorous performance requirements of state-of-the-art distributed inference, without having to sacrifice ease of use.

Through our joint engineering partnership, we are introducing three major architectural optimizations that dramatically improve Ray Serve LLM's performance characteristics:

Ray Serve HAProxy integration: Ray Serve now builds in HAProxy to manage internal request routing and load balancing. This setup drastically reduces proxy overhead and prevents the Python runtime from saturating under high traffic.

Direct token streaming architecture: This architecture decouples the initial request path from the return stream. Tokens stream directly from individual model replicas back to the proxy, bypassing the ingress router completely for the streaming data path to cut latency.

v2 Ray executor backend for vLLM: The revamped Ray backend for vLLM moves Ray out of the data plane to enable asynchronous scheduling. This unifies the code path with native vLLM executors, closing the performance gap and helping to ensure Ray users benefit from the latest engine-level optimizations.

We’ve also collaborated with Anyscale to benchmark the updated Ray Serve LLM on GKE clusters utilizing next-generation AI hardware, including Google Cloud A4 VMs powered by NVIDIA HGX B200 systems. We chose to run Gemma 4 E2B as a small, efficient model to isolate bottlenecks introduced from orchestration and routing. Our benchmarks compared the new Ray Serve LLM to its prior performance, as well as a plain vLLM setup using the Ray executor.

These technical enhancements deliver a transformative impact on performance, offering up to 5x higher throughput and 8x better latency compared to previous Ray Serve configurations.

The improved Ray Serve LLM demonstrated a remarkable improvement on a serving cluster with eight replicas, showing a scaling pattern that far exceeds previous performance, and showing comparable performance to running vLLM natively, but without the flexibility that Ray brings to the table.

We observe that with an increasing number of concurrent users, Ray is now able to scale up throughput while maintaining a low 99th percentile time-to-first-token, where previously it struggled. Now LLM practitioners don’t have to sacrifice Ray’s rich features and ecosystem to get production-grade performance on Kubernetes.

GKE provides the foundational infrastructure that makes these software optimizations shine. When using the Ray Operator add-on for GKE, you get turnkey deployment across Google Cloud's AI accelerators, including automated horizontal scaling, monitoring, multi-cluster scaling, and built-in fault tolerance. GKE abstracts the complex parts of orchestrating distributed physical hardware, so your team can focus on refining your models and application logic with Ray.

We encourage developers to try out these enhancements in the latest Ray release (2.56 and later) and experience the future of high-performance LLM serving on GKE.

For more details, check out the following resources:

source & further reading

cloud.google.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/scaling-ray-serve-llm-on…

Read original on cloud.google.com → cloud.google.com/blog/products/containers-kubern…

mentioned entities

Google Cloud

Anyscale

Ray Serve

Google Kubernetes Engine

vLLM

NVIDIA

Gemma 4

metadata

slugscaling-ray-serve-llm-on-gke-performance-without-losing-the-developer-experience

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalcloud.google.com

navigation

← prevStop Saying Half of 2026 US Data…

next →I Built My Own Analytics + AB Te…

── more in #large-language-models 4 stories · sorted by recency

anyscale.com · 18 Jun · #large-language-models

High Performance Distributed Inference with Ray Serve LLM

aws.amazon.com · 18 Jun · #large-language-models

Amazon Bedrock AgentCore harness is now generally available: Go from idea to production-grade agent in minutes

narracomm.com · 18 Jun · #large-language-models

NVIDIA Ecosystem Brings Causal AI, Real-Time Bidding, and Agentic Workflows to Cannes Lions

buildforever.com · 18 Jun · #large-language-models

An AI teammate that builds itself: how we wired Avery into our company's DNA

── more on @google cloud 3 stories trending now

wpnews · 17 Jun · #developer-tools

CircleCI MCP Server: Debug Build Failures Without Leaving Your AI Coding Agent

wpnews · 16 Jun · #large-language-models

I'm building CortexDB — an agent-native context database for AI agents

wpnews · 18 Jun · #ai-agents

How to Automate Business Reports With an AI Agent Instead of Dashboards

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required