{"slug": "high-performance-distributed-inference-with-ray-serve-llm", "title": "High Performance Distributed Inference with Ray Serve LLM", "summary": "Ray Serve LLM, in partnership with Google Kubernetes Engine, announced major performance improvements achieving up to 4.4x higher throughput on prefill-heavy workloads and 24x higher on decode-heavy workloads through optimizations including direct streaming, a new vLLM Ray executor backend, and HAProxy integration. The updates enable Ray Serve LLM to match the performance of the high-performance vllm-router, marking a significant milestone in distributed LLM inference.", "body_md": "# High Performance Distributed Inference with Ray Serve LLM\n\n[Seiji Eicher](/blog?author=seiji-eicher),\n\n[Jeffrey Wang](/blog?author=jeffrey-wang),\n\n[Kourosh Hakhamaneshi](/blog?author=kourosh-hakhamaneshi)and\n\n[Spencer Peterson (Google)](/blog?author=spencer-peterson-google)| June 18, 2026\n\nToday, in partnership with the Google Kubernetes Engine (GKE) team at Google Cloud, we are announcing a major milestone in [ Ray Serve LLM](https://docs.ray.io/en/master/serve/llm/index.html)’s throughput and latency characteristics, driven by architecture changes across the stack. We include comparisons to a known high-performance, rust-based routing framework,\n\n[, as well as a retrospective performance comparison, to illustrate the progress Ray Serve LLM has made in reducing orchestration overhead.](https://vllm.ai/blog/2025-12-13-vllm-router-release)\n\n__vllm-router__Ray is a popular choice for complex distributed computing batch inference pipelines with heterogeneous hardware. In addition, we believe that Ray’s powerful primitives for fault tolerance, observability, flexibility across Kubernetes and VMs will enable the next generation of optimizations as LLM inference deployments become increasingly complex.\n\nBelow, we cover three major optimizations to the Ray Serve LLM + vLLM stack: direct streaming, a new vLLM Ray executor backend, and HAProxy integration. As a result, we see up to 4.4x higher request throughput than previous versions on prefill-heavy workloads, and up to 24x higher request throughput on decode-heavy workloads.\n\nCumulative Effect of Optimizations: The figure above shows the cumulative effect of the incremental optimizations compared to vLLM behind vllm-router. Ray Serve LLM now matches vllm-router performance in both prefill- and decode-heavy workloads, representing a 4.4x and 24.8x improvement over the Ray Serve LLM baseline prior to the optimization effort.1\n\n## LinkWhat’s new?\n\nThree major optimizations contribute to the Ray Serve LLM’s new performance capabilities.\n\n### LinkRay Serve LLM: Direct Streaming\n\nRay 2.56 introduces direct streaming mode for Ray Serve LLM. This new architecture decouples the request routing control plane from the request/response streaming data plane.\n\nOn the forward path, the HAProxy ingress load balancer queries an *ingress request router* with the request content for a routing decision, based on a user-configured routing policy. Next, HAProxy establishes a direct HTTP connection with the selected target replica and streams tokens directly back to the client.\n\nThe new design resolves a bottleneck in the legacy architecture where the intermediate routing deployment (OpenAiIngress) was also responsible for forwarding response tokens back to HAProxy, taxing its event loop and adding to time per output token (TPOT). Try this out by setting `RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1.`\n\nSee [ docs](https://docs.ray.io/en/master/serve/llm/user-guides/direct-streaming.html) for usage.\n\nRay Serve LLM Direct Streaming: In the figure above, LLMRouter serves as the direct streaming application’s ingress request router. After serving a routing decision HAProxy can establish a connection directly to the target replica for data-plane communication. OpenAiIngress was the intermediate routing deployment used in the legacy architecture.\n\n### LinkvLLM: Ray Executor Backend V2\n\nThe [ revamped](https://github.com/vllm-project/vllm/pull/36836) Ray backend for vLLM,\n\n`RayExecutorV2`\n\n, is enabled by default in vLLM 0.21.0 and combines the process management capabilities with the battle-tested feature set of the `mp`\n\nbackend’s data and control planes. In addition, the new Ray backend facilitates the inheritance of other features such as asynchronous scheduling.### LinkRay Serve: HAProxy\n\nIn Ray 2.55, we released two major optimizations to Ray Serve: a C-based, HAProxy ingress load balancer and high throughput mode optimizations. For LLM serving, this also included disabling TCP datagram buffering (Nagle’s algorithm) by default for improved streaming performance. Details are covered in the announcement [ blogpost](https://www.anyscale.com/blog/ray-serve-inference-lower-latency-higher-throughput-haproxy) and\n\n[.](https://docs.ray.io/en/latest/serve/advanced-guides/performance.html#prerequisites)\n\n__docs__In Ray 2.56, HAProxy is available in all [ rayproject/ray](https://hub.docker.com/r/rayproject/ray) container images, including\n\n`rayproject/ray-llm:2.56-py312-cu130`\n\n, our recommended container image for LLM serving, which includes extras from the vLLM base images, such as DeepGEMM.If the Ray docker images can’t be used, in Ray 2.56, HAProxy can be installed via `pip install ray-haproxy`\n\nand enabled with `RAY_SERVE_EXPERIMENTAL_PIP_HAPROXY=1`\n\n. The binary will be automatically included and enabled with `pip install ray[serve] in Ray 2.57.`\n\n## LinkBenchmarks\n\nWe considered workloads with varying input sequence length (ISL) to output sequence length (OSL) ratios to simulate generic prefill- and decode-heavy workloads, and a multi-turn agentic workload to demonstrate request routing and cache reuse capabilities. In particular, these were:\n\nRandomized prefill-heavy workload with ISL=8000, OSL=50\n\nRandomized decode-heavy workload with ISL=50, OSL=500\n\nSimulated prompt and traffic pattern traces from a multi-turn coding agent capped at 20 turns\n\nThe random workloads are intended to isolate orchestration due to the lack of prefix-caching benefits in the workload. For example, prefill-heavy workloads tend to highlight time to first token (TTFT), while decode-heavy workloads highlight time per output token (TPOT). For these experiments, we sweep concurrency and measure TTFT, TPOT and throughput for each of the tested frameworks after a set of warm up requests to eliminate cold start artifacts.\n\nFor the third case, we generated a synthetic agentic workload using Dynamo’s [ aiperf](https://github.com/ai-dynamo/aiperf) benchmark suite. With this benchmark suite, we are able to describe scenarios like number of multi-turn coding sessions, distribution of wait times for tools and human interactions and number of shared or separate context tokens for sessions. In particular, we emulated a workload with the following characteristics:\n\nFixed number of 20 turns per session\n\nMean initial context = 25,000 tokens and median = 24,000 tokens\n\nMean new tokens = 1,000 and median = 400, modeling short and long tool call responses\n\nMean generation length = 230 and median = 70\n\nMedian inter-turn latency of 1.2 seconds\n\nEffective shared prefix rate of 96% per session\n\nThis workload simulates traffic patterns coming from a coding agent with simulated wait times between turns when the agent is waiting on tool calls. We can use this workload to compare different routing policies as well as frameworks. In particular we compared:\n\nvllm-router’s consistent hashing algorithm\n\nRay Serve LLM with consistent hashing\n\nFor agentic workloads, we can include a session ID with requests and use a consistent hashing algorithm to do load-balancing. See the Ray Serve docs on [ consistent hashing](https://docs.ray.io/en/master/serve/advanced-guides/custom-request-router.html#experimental-use-the-consistent-hash-request-router-for-session-stickiness) for more.\n\nTo isolate framework overhead, we used very small models: `Qwen/Qwen3-0.6B`\n\nfor eight replica trials and `microsoft/Phi-tiny-MoE-instruct`\n\nfor the prefill/decode disaggregation and WideEP trials.\n\n## LinkResults\n\n### LinkRouting across eight Qwen3-0.6B replicas\n\nAcross all three multi-replica workloads, Ray Serve LLM matches vllm-router’s aggregate throughput at every concurrency level tested. Each row in the figure corresponds to a workload: prefill-heavy, decode-heavy, and agentic coding. Each column is an identical metric: mean TTFT, mean TPOT, and throughput measured in requests per second, comparing Ray Serve LLM to vllm-router across parameterized user request concurrencies (batch size) on the x-axis.\n\nFor the concurrency 256 random workloads,** Ray Serve LLM matches or beats vllm-router on TTFT: 355ms vs. vllm-router’s 389ms on prefill-heavy workloads, and 165ms vs. 190ms on decode-heavy**. Throughput tracks closely for all experiments. On the realistic agentic multi-turn workload with KV-aware/session-affinity routing, Ray Serve LLM tracks vllm-router closely on TPOT, and is slightly ahead in TTFT and request throughput.\n\nWe investigated the divergence in decode-heavy TTFT between the two frameworks, and found that TTFT matched closely from the engine perspective at concurrency 256 (14.7ms Ray Serve LLM vs. 17.7ms vllm-router mean). This suggests that the reduced client-perspective TTFT Ray Serve LLM is driven by efficiency in the HAProxy ingress dataplane.\n\n### LinkWideEP and Prefill/Decode Disaggregation on Phi-tiny-MOE\n\nIn the disaggregated 4P4D Wide-EP configuration (one DP4EP4 prefill replica, one DP4EP4 decode replica), Ray Serve LLM beats vllm-router output throughput across the full concurrency range using the same agentic workload from the eight replica scaling trials above. **At high concurrency, Ray’s mean TPOT/ITL is slightly better: 13.6ms vs. vLLM-router’s 14.8ms at concurrency 256. **Additionally, the effect of Ray Serve LLM’s prefill/decode disaggregation architecture is shown in reduced TTFT compared to the baseline; tokenization is done once and reused, reducing frontend overhead for long prompts. For more information on Ray Serve LLM’s prefill/decode disaggregation and Wide-EP APIs, see [ here](https://www.anyscale.com/blog/ray-serve-llm-anyscale-apis-wide-ep-disaggregated-serving-vllm).\n\n### LinkAcknowledgements\n\nThis milestone would not have been possible without Anyscale and Ray’s ongoing engineering collaboration with the Google Kubernetes Engine Ray team, who were key in advocating for and validating the HAProxy and Direct Streaming architectures.\n\nYou can see more details on the [ GKE partner blog post: DeepSeek-V4 + Gemma 4](https://cloud.google.com/blog/products/containers-kubernetes/improving-ray-serve-llm-on-gke-throughput-latency) results on B200.\n\n### LinkConclusion\n\nWith optimizations across the stack: HAProxy at the Ray Serve layer, direct streaming in Ray Serve LLM, and the v2 Ray executor backend in vLLM, we have significantly reduced the orchestration overhead that previously separated Ray Serve LLM from standalone vLLM.\n\nAcross prefill-heavy, decode-heavy, and agentic multi-turn workloads, Ray Serve LLM now matches vllm-router on aggregate throughput while preserving Ray's fault tolerance, observability, and heterogeneous-hardware primitives. These same primitives extend cleanly to disaggregated prefill/decode and wide-EP topologies, giving developers a single substrate for both the simple single-replica case and the most complex production serving patterns.\n\nTry it out in Ray 2.56, and join us on the [ Ray Slack](https://www.ray.io/join-slack) to share feedback!\n\n## LinkAppendix\n\n### LinkReproduction Notes\n\nBenchmark code here: __https://github.com/anyscale/llm-direct-streaming-benchmarks__\n\nvLLM version: 0.22.0\n\nRay version: 2.56 nightly\n\nvllm-router: 0.1.14\n\nAIPerf: 0.8.0\n\nGPUs: 8x NVIDIA H100 80GB HBM3\n\nGPU driver: 580.126.20\n\nCUDA env version: 13.0.0\n\nNCCL env version: 2.27.7\n\nCPU: AMD EPYC 7R13 Processor\n\nCPU topology: 192 logical CPUs, 2 sockets, 48 cores/socket, 2 threads/core, 2 NUMA nodes\n\nMemory: 2.0 TiB\n\n1*In Ray versions prior to 2.54, we implemented a batching mechanism to mitigate Python event-loop contention in the default streaming path. This batching reduced orchestrator overhead and improved streaming performance by decreasing event-loop pressure. For the comparison shown in this chart, those batching-based mitigations were intentionally disabled. We compare the unbatched baseline of the earlier version against the unbatched configuration with the new optimizations enabled, ensuring an apples-to-apples comparison.*", "url": "https://wpnews.pro/news/high-performance-distributed-inference-with-ray-serve-llm", "canonical_source": "https://anyscale.com/blog/high-performance-distributed-inference-ray-serve-llm-vllm-google-kubernetes-gke", "published_at": "2026-06-18 09:00:00+00:00", "updated_at": "2026-06-18 16:44:08.085092+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "ai-products", "machine-learning"], "entities": ["Ray", "Ray Serve LLM", "Google Kubernetes Engine", "Google Cloud", "vLLM", "HAProxy", "vllm-router", "OpenAiIngress"], "alternates": {"html": "https://wpnews.pro/news/high-performance-distributed-inference-with-ray-serve-llm", "markdown": "https://wpnews.pro/news/high-performance-distributed-inference-with-ray-serve-llm.md", "text": "https://wpnews.pro/news/high-performance-distributed-inference-with-ray-serve-llm.txt", "jsonld": "https://wpnews.pro/news/high-performance-distributed-inference-with-ray-serve-llm.jsonld"}}