Disaggregating LLM Inference: Inside AMD's ATOM and ATOMesh Stack

wpnews.pro

AMD's native ROCm serving stack splits prefill and decode to eliminate head-of-line blocking on Instinct hardware.

Large language model (LLM) serving is undergoing a fundamental architectural shift. For years, the standard deployment pattern has been co-location: running both the prefill and decode phases of inference on the same GPU. However, as context windows expand and concurrency demands rise, this unified approach has become a major bottleneck.

On June 16, 2026, AMD released ATOM and ATOMesh, a paired, ROCm-native LLM serving stack designed specifically for Instinct GPUs (such as the MI300X). Shipped as an early preview, this release targets prefill/decode (P/D) disaggregation—the practice of splitting these two phases onto separate, dedicated pools of GPUs.

This release is more than a minor software update; it represents a mature, open-source alternative to proprietary inference orchestration layers. By decoupling the compute-bound prefill phase from the memory-bandwidth-bound decode phase, developers running self-hosted LLM infrastructure on AMD silicon can unlock significantly higher hardware utilization and lower latency.

The Physics of the Split: Why Co-location Fails #

To understand why disaggregation is becoming a production necessity, one must look at the roofline limits of GPU execution. LLM inference is a tale of two entirely different workloads:

Prefill (The Compute-Bound Phase): When a prompt is submitted, the model processes the entire input sequence in parallel to compute the initial Key-Value (KV) cache. This phase is dominated by dense General Matrix Multiply (GEMM) operations. Because there is high arithmetic intensity (lots of math per byte of memory read), prefill is limited by the GPU's raw compute units (TFLOPs).Decode (The Memory-Bandwidth-Bound Phase): Once the prefill is complete, the model generates output tokens autoregressively, one by one. Each step requires reading the entire model weights and the accumulated KV cache from High Bandwidth Memory (HBM) to generate a single token. The arithmetic intensity here is incredibly low; the bottleneck is strictly how fast memory can be streamed into the tensor cores.

When these two phases share a single GPU pool, they fight for resources. A massive, compute-heavy prefill request will stall the execution of active, memory-bound decode streams—a phenomenon known as head-of-line blocking. Conversely, during periods dominated by decode steps, the GPU's expensive compute engines sit idle, waiting for memory transfers.

Disaggregation solves this by establishing a physical separation, routing prompts to a "prefill pool" and token generation to a "decode pool."

sequenceDiagram
    autonumber
    actor Client
    participant Gateway as ATOMesh Gateway
    participant Prefill as Prefill Pool (GPU)
    participant Decode as Decode Pool (GPU)
    
    Client->>Gateway: Prompt Request
    Gateway->>Prefill: Route Prompt (Compute-Bound)
    Note over Prefill: Run Prefill Phase<br/>Generate KV Cache
    Prefill-->>Decode: Push KV Cache (RDMA / MORI-IO)
    Prefill->>Gateway: Initial Token / Metadata
    Gateway->>Decode: Route Generation (Memory-Bound)
    loop Autoregressive Generation
        Note over Decode: Run Decode Phase<br/>Stream Weights & KV Cache
        Decode->>Gateway: Output Token
        Gateway->>Client: Stream Token
    end

Inside the ATOM + ATOMesh Architecture #

AMD’s disaggregated stack is split into two primary layers: ATOMesh at the orchestration layer, and ATOM at the engine layer.

ATOMesh: The Orchestration Gateway

ATOMesh acts as the distributed gateway. It exposes an OpenAI-compatible API and manages request routing, worker health, retries, and scaling. Crucially, it features a unified placement core that handles transport-neutral routing for both HTTP and gRPC.

Instead of maintaining separate routing pipelines, ATOMesh uses a centralized planner to generate a placement plan. For disaggregated workloads, this plan pairs a prefill worker with a decode worker. ATOMesh also implements KV-aware scheduling, routing requests based on where KV-cache blocks already reside (for prefix caching) to minimize unnecessary data transfers.

ATOM: The Execution Engine

Below the orchestration layer sits the ATOM GitHub engine, a lightweight, vLLM-like implementation optimized specifically for ROCm. ATOM leverages several key lower-level libraries:

Shadow GPS — know where it is, always Real-time GPS tracking for vehicles, gear and loved ones. No monthly contracts.

AITER: Provides highly optimized ROCm-native GPU kernels (using Triton, Composable Kernel, and assembly) to accelerate compute-heavy operations.MORI: Handles distributed, RDMA-oriented communication. For multi-GPU setups, MORI manages tensor, data, and expert parallelism (EP) via optimized all-to-all collectives.MORI-IO & Mooncake: These protocols manage the high-speed transfer of the KV cache from the prefill pool to the decode pool over RDMA using a push-mode architecture.

Additionally, ATOM supports Two-Batch Overlap (TBO), which splits execution batches into micro-batches to pipeline compute and communication streams, effectively hiding expert-parallel communication latency.

The Developer Angle: Deploying Disaggregated Serving #

Implementing P/D disaggregation is not a simple software toggle; it requires strict hardware coordination, particularly on the networking side. Because the KV cache must be transferred from the prefill pool to the decode pool before generation can begin, the interconnect is the primary point of failure.

1. Hardware and Network Prerequisites

To run a disaggregated setup on AMD Instinct GPUs (like the MI300X), you must have:

RDMA-Capable NICs: Broadcom Thor2 (BCM-57608) or NVIDIA/Mellanox ConnectX network cards are required.RoCEv2 or InfiniBand: The network must support RDMA over Converged Ethernet (RoCEv2 with Priority Flow Control enabled) or native InfiniBand, properly cabled and switch-configured.ROCm 6.3 or later installed on the host.

2. Setting Up the Container Environment

When launching Docker containers for disaggregated serving, you must map both the GPU compute devices (/dev/kfd

, /dev/dri

) and the RDMA network devices (/dev/infiniband

) into the container.

docker run -it --rm \
  --network=host \
  --ipc=host \
  --shm-size 32G \
  --device=/dev/kfd \
  --device=/dev/dri \
  --device=/dev/infiniband \
  --device=/dev/infiniband/rdma_cm \
  --privileged \
  --cap-add=SYS_ADMIN \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $(pwd):/workspace \
  lmsysorg/sglang:v0.4.9-rocm630

(Note: SGLang can act as an execution backend behind ATOMesh, using SGLang's native ROCm runner or delegating to ATOM via plugins).

3. Launching the ATOM Engine

For teams utilizing the native ATOM engine, starting an OpenAI-compatible server with tensor parallelism (TP) and FP8 KV caching is straightforward. For instance, to serve a large model like DeepSeek-R1 across 8 GPUs:

python -m atom.entrypoints.openai_server \
  --model deepseek-ai/DeepSeek-R1 \
  --kv_cache_dtype fp8 \
  -tp 8

To run disaggregated serving, developers configure ATOMesh to manage two distinct backend groups—one started with a --role prefill

flag and another with --role decode

—allowing ATOMesh to orchestrate the MORI-IO RDMA push-mode transfers between them.

Strategic Trade-offs and the Road Ahead #

While P/D disaggregation offers clear throughput advantages, it introduces new architectural trade-offs that infrastructure teams must evaluate.

| Metric / Feature | Co-located Serving ( | |---|

Hardware EfficiencyLatency ConsistencyNetwork Dependency****Operational Complexity If your workload consists of short prompts and short generations, the overhead of transferring the KV cache over the network may eclipse the benefits of disaggregation. However, for modern workloads characterized by long-context retrieval (RAG), multi-turn agentic workflows, and high concurrency, disaggregation is the only viable path to maintaining a low cost-per-token.

AMD's release of ATOM and ATOMesh proves that the ROCm software ecosystem is maturing rapidly. By providing native support for advanced serving primitives like Mooncake RDMA transfers, piecewise torch.compile

with CUDA graphs, and Two-Batch Overlap, AMD is closing the software gap with NVIDIA's TensorRT-LLM. For enterprise teams building custom, large-scale inference clusters, this stack makes AMD Instinct hardware a highly competitive, production-ready target.

Sources & further reading #

AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm— dev.to - ATOMesh: Unlocking AMD Hardware for Scalable LLM Serving — ROCm Blogs— rocm.blogs.amd.com - GitHub - ROCm/ATOM: AiTer Optimized Model · GitHub— github.com - LLM distributed inference and PD disaggregation on AMD Instinct GPUs — Tutorials for AI developers 13.0— rocm.docs.amd.com

Ji-ho Choi· Security & Cloud Editor

Ji-ho covers the increasingly tangled overlap between cloud architecture and security, drawing on a background as a penetration tester to keep his reporting grounded in real-world attack paths. He never lets a vendor claim go unquestioned and insists that every buzzword come with a proof of concept.

Discussion 0 #

No comments yet

Be the first to weigh in.

source & further reading

devclubhouse.com — original article Orchestrating Chaos: Dynamic Multi-Agent Workflows in Claude Code Beyond the Demo: Engineering Reliable, Production-Grade AI Agents Analyze Images and PDFs with Google Gemini's Multimodal API in Python