67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X

Engineers achieved up to 67% cost savings and 2.7x better goodput by using Prefill-Decode disaggregation with Ray and vLLM on AMD MI325X GPUs, separating prefill and decode phases onto dedicated hardware to eliminate mutual interference and improve throughput under latency SLAs.

Achieving Up to 67% Cost Savings with Prefill-Decode Disaggregation Using Ray + vLLM on AMD MI325X Kourosh Hakhamaneshi /blog?author=kourosh-hakhamaneshi | June 12, 2026 In LLM serving, the optimization objective is deceptively simple: given a set of latency SLA targets – time to first token TTFT , time per output token TPOT , end-to-end latency E2E – maximize the queries per second QPS you can sustain, also known as the “goodput”. One of the most powerful levers for breaking through the goodput ceiling is Prefill-Decode PD disaggregation https://haoailab.com/blogs/distserve-retro/ . In this post, we cover how we used Ray Serve LLM to orchestrate PD disaggregation inference workloads on AMD, achieving up to 2.7x better goodput . The key advantage for using PD disaggregation is that instead of running both phases on the same GPUs – where they compete for compute, memory bandwidth, and scheduling budget – PD separates them onto dedicated hardware. Prefill GPUs handle prompt processing. Decode GPUs handle token generation. By eliminating mutual interference, each phase runs closer to its theoretical throughput, and the system as a whole serves more requests under the same SLA constraints. However, this is no free lunch – PD adds operational complexity: KV cache must be transferred across nodes and the prefill-to-decode ratio must be tuned per workload. We show results where under the same GPU budget and SLA, Prefill-Decode disaggregation on Ray + vLLM can serve 1.3x to 2.3x more QPS than aggregated serving – depending on the workload up to 67% compute cost reduction and also results where it does not help – so you can make the right decision for your workload. We tested two large MoE models across a range of workloads – varying input/output lengths, KV cache hit rates, and P:D ratios – to find where PD disaggregation saves cost and where it doesn't. This post walks through the core intuition behind why PD disaggregation helps, the AMD stack needed to enable PD disaggregation RIXL for KV transfer , and how to set it up with Ray Serve. For a managed solution you can use Anyscale https://www.anyscale.com/ bringing similar cost savings to your workloads. LinkCore Intuition – Why PD Works and When It Doesn't This section covers the four key insights you need to reason about PD disaggregation for any workload. Each insight includes data from our experiments, plus clear guidance on when PD disaggregation loses. These results are agnostic to hardware setup and should be taken as general guidelines and conclusions about prefill decode disaggregation. LinkInsight 1: PD does NOT make prefill faster – it can actually hurt TTFT The most common misconception about PD is that it speeds up everything. It does not. On the metric that matters most for interactive responsiveness – time to first token – PD is consistently slower than aggregated serving on the same GPU footprint. Why aggregated TTFT is already good. In vLLM's scheduler, there is no separate "prefill phase" or "decode phase." The scheduler runs all currently-active requests – both prefill and decode – before admitting new requests from the waiting queue. Chunked prefill is enabled by default for all decoder-only models, i.e. long prompts are split into chunks sized by max num batched tokens defaults to 8192 . Each chunk runs as one scheduler iteration. Typically, decode steps consume trivially little budget per iteration. For example, a batch of 128 concurrent decode requests uses at most 128 tokens out of the 8192-token budget, leaving the vast majority of each iteration available for prefill tokens. In this case, when a new request gets added to the batch, the 8064 unused token budget gets allocated to the prefill of the new request. This implies inflation on the TPOT of that iteration but not much inflation on the TTFT. TTFT is therefore dominated by the raw compute time of the prefill forward pass attention + MoE routing , not by contention with decode. What PD changes. PD adds a KV cache transfer step after prefill completes. The prefill node sends KV data over the network RDMA/RoCE to the decode node. This transfer has inherent overhead that depends on model architecture, KV cache size, and network conditions. Under high load and kv-cache pressure, prefill nodes can also queue up, adding queuing delay on top of transfer overhead. The net effect. On the same GPU footprint, aggregated consistently achieves equal or lower TTFT than PD. If your SLA is measured purely on time-to-first-token e.g., interactive search, auto-complete , aggregated will consistently beat PD. As Figure 3 shows, PD's TTFT baseline on DeepSeek-V3 is ~330ms due to KV transfer overhead , while Agg stays at ~260ms across all QPS levels: Under a TTFT < 300ms SLA, PD cannot serve any traffic baseline TTFT exceeds the target , while Agg sustains 7.0+ QPS .Under a TTFT < 500ms SLA, Agg sustains 7.0+ QPS vs PD's 5.0 QPS – Agg wins by at least 1.4x. Agg's advantage here is structural: no KV transfer step, and prefill load is naturally distributed across replicas. If you need both fast TTFT and fast TPOT, consider accepting a slightly relaxed TTFT target – even a small relaxation can unlock major TPOT and E2E improvements through PD. Bottom line: If your SLA is strictly TTFT-limited, aggregated is the simpler and better choice. LinkInsight 2: PD's real win is flat, stable TPOT under load This is the core mechanism behind PD's value. In aggregated serving, prefill and decode share the same GPU. Each scheduler iteration that includes prefill tokens is compute-heavier than a pure-decode iteration. As QPS rises, more prefill work stacks up, and decode tokens wait longer. TPOT degrades linearly or worse with increasing QPS, because every new prefill request steals compute from all in-flight decode steps. PD eliminates this entirely. Decode runs on dedicated GPUs that never see a prefill token. TPOT stays nearly flat regardless of how much prefill work is happening on other nodes. LinkInsight 3: TPOT savings compound over output sequence length PD's per-token TPOT advantage looks modest in isolation 5-10ms . But it multiplies across every output token: Total savings = TPOT delta × output length . This compounding is why PD wins on E2E latency despite losing on TTFT. When PD loses: short output. For short-output workloads classification, extraction, short QA , the savings don't accumulate enough to justify the complexity. In these cases you should use aggregated. LinkInsight 4: The optimal P:D ratio depends on your workload This is the most practical insight for practitioners. The P:D ratio determines how GPU resources are split between prefill and decode. Getting it wrong can make PD worse than aggregated. Key findings across workloads: | | | | Long input, short output ISL=16K, OSL=1K | 0% | Prefill throughput | | Long input, long output ISL=16K, OSL=4K | 0% | Decode throughput | | Multi-turn with high cache reuse | 80% | Decode throughput | | Multi-turn with moderate cache reuse | 30–60% | Mixed | | Rule of thumb: The marginal GPU should go to wherever the bottleneck is. High cache hit rates make prefill cheap – allocate more to decode. Low cache hit rates with long inputs – allocate more to prefill. The most common PD pitfall: deploying with a ratio that does not match the workload. This can make PD strictly worse than aggregated on every metric. LinkWhat's Special About AMD – RIXL and the KV Transfer Stack PD disaggregation requires high-bandwidth KV cache transfer between prefill and decode nodes. On NVIDIA hardware, this is handled by NIXL NVIDIA Interconnect eXchange Library over NVLink, InfiniBand, or EFA. On AMD, we use RIXL ROCm Interconnect eXchange Library – a plug-and-play replacement for NIXL that uses UCX transport over RDMA/RoCE InfiniBand. The key point is that RIXL exposes the same NixlConnector interface in vLLM. Zero code changes are needed in the serving layer. If your vLLM config says kv connector: NixlConnector , it works on both NVIDIA via NIXL and AMD via RIXL . LinkContainer and Software Stack Our container image is built from Dockerfile.v0.18.dev with the following components: | | Base image | | ROCm | 7.0 | vLLM | 0.18.0 via | RIXL | Built from source | UCX | Built from source ROCm/ucx commit | ROCm Triton | Built from source | Python | 3.12 | VLLM ROCM USE AITER=1 AMD AI Tensor Engine Runtime VLLM ROCM USE AITER MOE=1 Optimized MoE kernels VLLM ROCM QUICK REDUCE QUANTIZATION=INT4 Fast all-reduce with INT4 quantization VLLM ROCM QUICK REDUCE CAST BF16 TO FP16=1 BF16-to-FP16 cast for quick-reduce LinkA Critical Operational Note Network transport quality is everything. With proper UCX/RDMA configuration, cross-node KV transfer performs comparably to intra-node. With TCP fallback, throughput degrades catastrophically – we observed up to 19x degradation in testing. Always validate the RDMA transport layer before benchmarking PD. The UCX transport configuration in our deployments: UCX TLS=rc,sm,self,rocm copy,rocm ipc UCX NET DEVICES=mlx5 0:1,mlx5 1:1,mlx5 2:1,mlx5 3:1,mlx5 4:1,mlx5 5:1,mlx5 6:1,mlx5 7:1 This configures 8x Mellanox ConnectX interfaces for RoCE fabric, using reliable connected RC transport with shared memory and ROCm GPU direct paths. Hardware tested: AMD MI325X with 288GB HBM3e, 8 GPUs per node. LinkHow to Reproduce Everything needed to reproduce these results is consolidated in a single repository: Dockerfile, serve configs, and benchmark scripts. The instructions assume you have a Ray cluster running on AMD MI325X nodes – via Anyscale https://docs.anyscale.com/get-started , , or bare metal. All artifacts are available https://docs.ray.io/en/latest/cluster/kubernetes/getting-started.html KubeRay . https://github.com/anyscale/pd-amd-blogpost here LinkConclusion PD disaggregation on Ray + vLLM delivers 1.3 –2.3x more QPS under the same GPU budget and SLA – up to 67% cost reduction on AMD MI325X. To summarize the main insights: PD wins when your SLA is TPOT- or E2E latency-sensitive and output is long enough for per-token savings to compound. Aggregated wins when TTFT is the binding constraint, output is short, or cache hit rates are high enough to eliminate prefill-decode contention. The P:D ratio matters. Given the workload the optimal value could be different from case to case. LinkGet Started Reproduce our results. Clone the, deploy a config, and run the benchmark CLI against your workload. repo Need managed solutions: gives you managed Ray services. You can deploy Anyscale on k8s and from then on you don’t have to manage Ray clusters. Anyscale Talk to us. Found a workload where PD behaves differently? We want to hear about it – reach out on theor Ray community . Ray’s slack Table of contents Core Intuition – Why PD Works and When It Doesn't core-intuition-–-why-pd-works- and-when-it-doesn't Insight 1: PD does NOT make prefill faster – it can actually hurt TTFT insight-1:-pd-does-not-make-prefill-faster-–-it-can-actually-hurt-ttft Insight 2: PD's real win is flat, stable TPOT under load insight-2:-pd's-real-win-is-flat,-stable-tpot-under-load Insight 3: TPOT savings compound over output sequence length insight-3:-tpot-savings-compound-over-output-sequence-length Insight 4: The optimal P:D ratio depends on your workload insight-4:-the-optimal-p:d-ratio-depends-on-your-workload What's Special About AMD – RIXL and the KV Transfer Stack what's-special-about-amd-–-rixl-and-the-kv-transfer-stack Container and Software Stack container-and-software-stack A Critical Operational Note a-critical-operational-note How to Reproduce how-to-reproduce Conclusion conclusion Get Started get-started