Disaggregating LLM Inference: Inside AMD's ATOM and ATOMesh Stack

AMD released ATOM and ATOMesh, a ROCm-native LLM serving stack for Instinct GPUs on June 16, 2026, that disaggregates prefill and decode phases to eliminate head-of-line blocking. The open-source stack splits inference into compute-bound prefill and memory-bandwidth-bound decode on separate GPU pools, improving hardware utilization and latency for self-hosted LLM infrastructure.

Cloud & Infra https://www.devclubhouse.com/c/cloud Article Disaggregating LLM Inference: Inside AMD's ATOM and ATOMesh Stack AMD's native ROCm serving stack splits prefill and decode to eliminate head-of-line blocking on Instinct hardware. Ji-ho Choi https://www.devclubhouse.com/u/jiho choi Large language model LLM serving is undergoing a fundamental architectural shift. For years, the standard deployment pattern has been co-location: running both the prefill and decode phases of inference on the same GPU. However, as context windows expand and concurrency demands rise, this unified approach has become a major bottleneck. On June 16, 2026, AMD released ATOM and ATOMesh , a paired, ROCm https://rocm.docs.amd.com -native LLM serving stack designed specifically for Instinct GPUs such as the MI300X . Shipped as an early preview, this release targets prefill/decode P/D disaggregation —the practice of splitting these two phases onto separate, dedicated pools of GPUs. This release is more than a minor software update; it represents a mature, open-source alternative to proprietary inference orchestration layers. By decoupling the compute-bound prefill phase from the memory-bandwidth-bound decode phase, developers running self-hosted LLM infrastructure on AMD silicon can unlock significantly higher hardware utilization and lower latency. The Physics of the Split: Why Co-location Fails To understand why disaggregation is becoming a production necessity, one must look at the roofline limits of GPU execution. LLM inference is a tale of two entirely different workloads: Prefill The Compute-Bound Phase : When a prompt is submitted, the model processes the entire input sequence in parallel to compute the initial Key-Value KV cache. This phase is dominated by dense General Matrix Multiply GEMM operations. Because there is high arithmetic intensity lots of math per byte of memory read , prefill is limited by the GPU's raw compute units TFLOPs . Decode The Memory-Bandwidth-Bound Phase : Once the prefill is complete, the model generates output tokens autoregressively, one by one. Each step requires reading the entire model weights and the accumulated KV cache from High Bandwidth Memory HBM to generate a single token. The arithmetic intensity here is incredibly low; the bottleneck is strictly how fast memory can be streamed into the tensor cores. When these two phases share a single GPU pool, they fight for resources. A massive, compute-heavy prefill request will stall the execution of active, memory-bound decode streams—a phenomenon known as head-of-line blocking . Conversely, during periods dominated by decode steps, the GPU's expensive compute engines sit idle, waiting for memory transfers. Disaggregation solves this by establishing a physical separation, routing prompts to a "prefill pool" and token generation to a "decode pool." sequenceDiagram autonumber actor Client participant Gateway as ATOMesh Gateway participant Prefill as Prefill Pool GPU participant Decode as Decode Pool GPU Client- Gateway: Prompt Request Gateway- Prefill: Route Prompt Compute-Bound Note over Prefill: Run Prefill Phase<br/ Generate KV Cache Prefill-- Decode: Push KV Cache RDMA / MORI-IO Prefill- Gateway: Initial Token / Metadata Gateway- Decode: Route Generation Memory-Bound loop Autoregressive Generation Note over Decode: Run Decode Phase<br/ Stream Weights & KV Cache Decode- Gateway: Output Token Gateway- Client: Stream Token end Inside the ATOM + ATOMesh Architecture AMD’s disaggregated stack is split into two primary layers: ATOMesh at the orchestration layer, and ATOM at the engine layer. ATOMesh: The Orchestration Gateway ATOMesh acts as the distributed gateway. It exposes an OpenAI-compatible API and manages request routing, worker health, retries, and scaling. Crucially, it features a unified placement core that handles transport-neutral routing for both HTTP and gRPC. Instead of maintaining separate routing pipelines, ATOMesh uses a centralized planner to generate a placement plan. For disaggregated workloads, this plan pairs a prefill worker with a decode worker. ATOMesh also implements KV-aware scheduling , routing requests based on where KV-cache blocks already reside for prefix caching to minimize unnecessary data transfers. ATOM: The Execution Engine Below the orchestration layer sits the ATOM GitHub https://github.com/ROCm/ATOM engine, a lightweight, vLLM-like implementation optimized specifically for ROCm. ATOM leverages several key lower-level libraries: Shadow GPS — know where it is, always Real-time GPS tracking for vehicles, gear and loved ones. No monthly contracts. https://www.devclubhouse.com/go/ad/12 AITER: Provides highly optimized ROCm-native GPU kernels using Triton, Composable Kernel, and assembly to accelerate compute-heavy operations. MORI: Handles distributed, RDMA-oriented communication. For multi-GPU setups, MORI manages tensor, data, and expert parallelism EP via optimized all-to-all collectives. MORI-IO & Mooncake: These protocols manage the high-speed transfer of the KV cache from the prefill pool to the decode pool over RDMA using a push-mode architecture. Additionally, ATOM supports Two-Batch Overlap TBO , which splits execution batches into micro-batches to pipeline compute and communication streams, effectively hiding expert-parallel communication latency. The Developer Angle: Deploying Disaggregated Serving Implementing P/D disaggregation is not a simple software toggle; it requires strict hardware coordination, particularly on the networking side. Because the KV cache must be transferred from the prefill pool to the decode pool before generation can begin, the interconnect is the primary point of failure. 1. Hardware and Network Prerequisites To run a disaggregated setup on AMD Instinct GPUs like the MI300X , you must have: RDMA-Capable NICs: Broadcom Thor2 BCM-57608 or NVIDIA/Mellanox ConnectX network cards are required. RoCEv2 or InfiniBand: The network must support RDMA over Converged Ethernet RoCEv2 with Priority Flow Control enabled or native InfiniBand, properly cabled and switch-configured. ROCm 6.3 or later installed on the host. 2. Setting Up the Container Environment When launching Docker containers for disaggregated serving, you must map both the GPU compute devices /dev/kfd , /dev/dri and the RDMA network devices /dev/infiniband into the container. docker run -it --rm \ --network=host \ --ipc=host \ --shm-size 32G \ --device=/dev/kfd \ --device=/dev/dri \ --device=/dev/infiniband \ --device=/dev/infiniband/rdma cm \ --privileged \ --cap-add=SYS ADMIN \ --cap-add=SYS PTRACE \ --security-opt seccomp=unconfined \ -v $ pwd :/workspace \ lmsysorg/sglang:v0.4.9-rocm630 Note: SGLang can act as an execution backend behind ATOMesh, using SGLang's native ROCm runner or delegating to ATOM via plugins . 3. Launching the ATOM Engine For teams utilizing the native ATOM engine, starting an OpenAI-compatible server with tensor parallelism TP and FP8 KV caching is straightforward. For instance, to serve a large model like DeepSeek-R1 across 8 GPUs: python -m atom.entrypoints.openai server \ --model deepseek-ai/DeepSeek-R1 \ --kv cache dtype fp8 \ -tp 8 To run disaggregated serving, developers configure ATOMesh to manage two distinct backend groups—one started with a --role prefill flag and another with --role decode —allowing ATOMesh to orchestrate the MORI-IO RDMA push-mode transfers between them. Strategic Trade-offs and the Road Ahead While P/D disaggregation offers clear throughput advantages, it introduces new architectural trade-offs that infrastructure teams must evaluate. | Metric / Feature | Co-located Serving | |---| Hardware Efficiency Latency Consistency Network Dependency Operational Complexity If your workload consists of short prompts and short generations, the overhead of transferring the KV cache over the network may eclipse the benefits of disaggregation. However, for modern workloads characterized by long-context retrieval RAG , multi-turn agentic workflows, and high concurrency , disaggregation is the only viable path to maintaining a low cost-per-token. AMD's release of ATOM and ATOMesh proves that the ROCm software ecosystem is maturing rapidly. By providing native support for advanced serving primitives like Mooncake RDMA transfers, piecewise torch.compile with CUDA graphs, and Two-Batch Overlap, AMD is closing the software gap with NVIDIA's TensorRT-LLM. For enterprise teams building custom, large-scale inference clusters, this stack makes AMD Instinct hardware a highly competitive, production-ready target. Sources & further reading - AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm https://dev.to/pueding/amd-atom-atomesh-prefilldecode-disaggregation-on-rocm-2p0a — dev.to - ATOMesh: Unlocking AMD Hardware for Scalable LLM Serving — ROCm Blogs https://rocm.blogs.amd.com/software-tools-optimization/atomesh-inference/README.html — rocm.blogs.amd.com - GitHub - ROCm/ATOM: AiTer Optimized Model · GitHub https://github.com/ROCm/ATOM — github.com - LLM distributed inference and PD disaggregation on AMD Instinct GPUs — Tutorials for AI developers 13.0 https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/inference/SGlang PD Disagg On AMD GPU.html — rocm.docs.amd.com Ji-ho Choi https://www.devclubhouse.com/u/jiho choi · Security & Cloud Editor Ji-ho covers the increasingly tangled overlap between cloud architecture and security, drawing on a background as a penetration tester to keep his reporting grounded in real-world attack paths. He never lets a vendor claim go unquestioned and insists that every buzzword come with a proof of concept. Discussion 0 No comments yet Be the first to weigh in.