High Performance Distributed Inference with Ray Serve LLM

Ray Serve LLM, in partnership with Google Kubernetes Engine, announced major performance improvements achieving up to 4.4x higher throughput on prefill-heavy workloads and 24x higher on decode-heavy workloads through optimizations including direct streaming, a new vLLM Ray executor backend, and HAProxy integration. The updates enable Ray Serve LLM to match the performance of the high-performance vllm-router, marking a significant milestone in distributed LLM inference.

High Performance Distributed Inference with Ray Serve LLM Seiji Eicher /blog?author=seiji-eicher , Jeffrey Wang /blog?author=jeffrey-wang , Kourosh Hakhamaneshi /blog?author=kourosh-hakhamaneshi and Spencer Peterson Google /blog?author=spencer-peterson-google | June 18, 2026 Today, in partnership with the Google Kubernetes Engine GKE team at Google Cloud, we are announcing a major milestone in Ray Serve LLM https://docs.ray.io/en/master/serve/llm/index.html ’s throughput and latency characteristics, driven by architecture changes across the stack. We include comparisons to a known high-performance, rust-based routing framework, , as well as a retrospective performance comparison, to illustrate the progress Ray Serve LLM has made in reducing orchestration overhead. https://vllm.ai/blog/2025-12-13-vllm-router-release vllm-router Ray is a popular choice for complex distributed computing batch inference pipelines with heterogeneous hardware. In addition, we believe that Ray’s powerful primitives for fault tolerance, observability, flexibility across Kubernetes and VMs will enable the next generation of optimizations as LLM inference deployments become increasingly complex. Below, we cover three major optimizations to the Ray Serve LLM + vLLM stack: direct streaming, a new vLLM Ray executor backend, and HAProxy integration. As a result, we see up to 4.4x higher request throughput than previous versions on prefill-heavy workloads, and up to 24x higher request throughput on decode-heavy workloads. Cumulative Effect of Optimizations: The figure above shows the cumulative effect of the incremental optimizations compared to vLLM behind vllm-router. Ray Serve LLM now matches vllm-router performance in both prefill- and decode-heavy workloads, representing a 4.4x and 24.8x improvement over the Ray Serve LLM baseline prior to the optimization effort.1 LinkWhat’s new? Three major optimizations contribute to the Ray Serve LLM’s new performance capabilities. LinkRay Serve LLM: Direct Streaming Ray 2.56 introduces direct streaming mode for Ray Serve LLM. This new architecture decouples the request routing control plane from the request/response streaming data plane. On the forward path, the HAProxy ingress load balancer queries an ingress request router with the request content for a routing decision, based on a user-configured routing policy. Next, HAProxy establishes a direct HTTP connection with the selected target replica and streams tokens directly back to the client. The new design resolves a bottleneck in the legacy architecture where the intermediate routing deployment OpenAiIngress was also responsible for forwarding response tokens back to HAProxy, taxing its event loop and adding to time per output token TPOT . Try this out by setting RAY SERVE LLM ENABLE DIRECT STREAMING=1. See docs https://docs.ray.io/en/master/serve/llm/user-guides/direct-streaming.html for usage. Ray Serve LLM Direct Streaming: In the figure above, LLMRouter serves as the direct streaming application’s ingress request router. After serving a routing decision HAProxy can establish a connection directly to the target replica for data-plane communication. OpenAiIngress was the intermediate routing deployment used in the legacy architecture. LinkvLLM: Ray Executor Backend V2 The revamped https://github.com/vllm-project/vllm/pull/36836 Ray backend for vLLM, RayExecutorV2 , is enabled by default in vLLM 0.21.0 and combines the process management capabilities with the battle-tested feature set of the mp backend’s data and control planes. In addition, the new Ray backend facilitates the inheritance of other features such as asynchronous scheduling. LinkRay Serve: HAProxy In Ray 2.55, we released two major optimizations to Ray Serve: a C-based, HAProxy ingress load balancer and high throughput mode optimizations. For LLM serving, this also included disabling TCP datagram buffering Nagle’s algorithm by default for improved streaming performance. Details are covered in the announcement blogpost https://www.anyscale.com/blog/ray-serve-inference-lower-latency-higher-throughput-haproxy and . https://docs.ray.io/en/latest/serve/advanced-guides/performance.html prerequisites docs In Ray 2.56, HAProxy is available in all rayproject/ray https://hub.docker.com/r/rayproject/ray container images, including rayproject/ray-llm:2.56-py312-cu130 , our recommended container image for LLM serving, which includes extras from the vLLM base images, such as DeepGEMM.If the Ray docker images can’t be used, in Ray 2.56, HAProxy can be installed via pip install ray-haproxy and enabled with RAY SERVE EXPERIMENTAL PIP HAPROXY=1 . The binary will be automatically included and enabled with pip install ray serve in Ray 2.57. LinkBenchmarks We considered workloads with varying input sequence length ISL to output sequence length OSL ratios to simulate generic prefill- and decode-heavy workloads, and a multi-turn agentic workload to demonstrate request routing and cache reuse capabilities. In particular, these were: Randomized prefill-heavy workload with ISL=8000, OSL=50 Randomized decode-heavy workload with ISL=50, OSL=500 Simulated prompt and traffic pattern traces from a multi-turn coding agent capped at 20 turns The random workloads are intended to isolate orchestration due to the lack of prefix-caching benefits in the workload. For example, prefill-heavy workloads tend to highlight time to first token TTFT , while decode-heavy workloads highlight time per output token TPOT . For these experiments, we sweep concurrency and measure TTFT, TPOT and throughput for each of the tested frameworks after a set of warm up requests to eliminate cold start artifacts. For the third case, we generated a synthetic agentic workload using Dynamo’s aiperf https://github.com/ai-dynamo/aiperf benchmark suite. With this benchmark suite, we are able to describe scenarios like number of multi-turn coding sessions, distribution of wait times for tools and human interactions and number of shared or separate context tokens for sessions. In particular, we emulated a workload with the following characteristics: Fixed number of 20 turns per session Mean initial context = 25,000 tokens and median = 24,000 tokens Mean new tokens = 1,000 and median = 400, modeling short and long tool call responses Mean generation length = 230 and median = 70 Median inter-turn latency of 1.2 seconds Effective shared prefix rate of 96% per session This workload simulates traffic patterns coming from a coding agent with simulated wait times between turns when the agent is waiting on tool calls. We can use this workload to compare different routing policies as well as frameworks. In particular we compared: vllm-router’s consistent hashing algorithm Ray Serve LLM with consistent hashing For agentic workloads, we can include a session ID with requests and use a consistent hashing algorithm to do load-balancing. See the Ray Serve docs on consistent hashing https://docs.ray.io/en/master/serve/advanced-guides/custom-request-router.html experimental-use-the-consistent-hash-request-router-for-session-stickiness for more. To isolate framework overhead, we used very small models: Qwen/Qwen3-0.6B for eight replica trials and microsoft/Phi-tiny-MoE-instruct for the prefill/decode disaggregation and WideEP trials. LinkResults LinkRouting across eight Qwen3-0.6B replicas Across all three multi-replica workloads, Ray Serve LLM matches vllm-router’s aggregate throughput at every concurrency level tested. Each row in the figure corresponds to a workload: prefill-heavy, decode-heavy, and agentic coding. Each column is an identical metric: mean TTFT, mean TPOT, and throughput measured in requests per second, comparing Ray Serve LLM to vllm-router across parameterized user request concurrencies batch size on the x-axis. For the concurrency 256 random workloads, Ray Serve LLM matches or beats vllm-router on TTFT: 355ms vs. vllm-router’s 389ms on prefill-heavy workloads, and 165ms vs. 190ms on decode-heavy . Throughput tracks closely for all experiments. On the realistic agentic multi-turn workload with KV-aware/session-affinity routing, Ray Serve LLM tracks vllm-router closely on TPOT, and is slightly ahead in TTFT and request throughput. We investigated the divergence in decode-heavy TTFT between the two frameworks, and found that TTFT matched closely from the engine perspective at concurrency 256 14.7ms Ray Serve LLM vs. 17.7ms vllm-router mean . This suggests that the reduced client-perspective TTFT Ray Serve LLM is driven by efficiency in the HAProxy ingress dataplane. LinkWideEP and Prefill/Decode Disaggregation on Phi-tiny-MOE In the disaggregated 4P4D Wide-EP configuration one DP4EP4 prefill replica, one DP4EP4 decode replica , Ray Serve LLM beats vllm-router output throughput across the full concurrency range using the same agentic workload from the eight replica scaling trials above. At high concurrency, Ray’s mean TPOT/ITL is slightly better: 13.6ms vs. vLLM-router’s 14.8ms at concurrency 256. Additionally, the effect of Ray Serve LLM’s prefill/decode disaggregation architecture is shown in reduced TTFT compared to the baseline; tokenization is done once and reused, reducing frontend overhead for long prompts. For more information on Ray Serve LLM’s prefill/decode disaggregation and Wide-EP APIs, see here https://www.anyscale.com/blog/ray-serve-llm-anyscale-apis-wide-ep-disaggregated-serving-vllm . LinkAcknowledgements This milestone would not have been possible without Anyscale and Ray’s ongoing engineering collaboration with the Google Kubernetes Engine Ray team, who were key in advocating for and validating the HAProxy and Direct Streaming architectures. You can see more details on the GKE partner blog post: DeepSeek-V4 + Gemma 4 https://cloud.google.com/blog/products/containers-kubernetes/improving-ray-serve-llm-on-gke-throughput-latency results on B200. LinkConclusion With optimizations across the stack: HAProxy at the Ray Serve layer, direct streaming in Ray Serve LLM, and the v2 Ray executor backend in vLLM, we have significantly reduced the orchestration overhead that previously separated Ray Serve LLM from standalone vLLM. Across prefill-heavy, decode-heavy, and agentic multi-turn workloads, Ray Serve LLM now matches vllm-router on aggregate throughput while preserving Ray's fault tolerance, observability, and heterogeneous-hardware primitives. These same primitives extend cleanly to disaggregated prefill/decode and wide-EP topologies, giving developers a single substrate for both the simple single-replica case and the most complex production serving patterns. Try it out in Ray 2.56, and join us on the Ray Slack https://www.ray.io/join-slack to share feedback LinkAppendix LinkReproduction Notes Benchmark code here: https://github.com/anyscale/llm-direct-streaming-benchmarks vLLM version: 0.22.0 Ray version: 2.56 nightly vllm-router: 0.1.14 AIPerf: 0.8.0 GPUs: 8x NVIDIA H100 80GB HBM3 GPU driver: 580.126.20 CUDA env version: 13.0.0 NCCL env version: 2.27.7 CPU: AMD EPYC 7R13 Processor CPU topology: 192 logical CPUs, 2 sockets, 48 cores/socket, 2 threads/core, 2 NUMA nodes Memory: 2.0 TiB 1 In Ray versions prior to 2.54, we implemented a batching mechanism to mitigate Python event-loop contention in the default streaming path. This batching reduced orchestrator overhead and improved streaming performance by decreasing event-loop pressure. For the comparison shown in this chart, those batching-based mitigations were intentionally disabled. We compare the unbatched baseline of the earlier version against the unbatched configuration with the new optimizations enabled, ensuring an apples-to-apples comparison.