AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm

AMD shipped ATOM + ATOMesh, a ROCm-native LLM serving stack for Instinct GPUs that implements prefill/decode disaggregation, splitting the two inference phases onto separate GPU pools to optimize for their opposite bottlenecks. The stack, released as an alpha preview on June 16, 2026, includes ATOM as an AITER-optimized inference engine and ATOMesh as the orchestration layer, evaluated serving DeepSeek-V4-Pro on Instinct hardware.

What: AMD shipped ATOM + ATOMesh , a ROCm-native LLM serving stack whose headline trick is prefill/decode disaggregation — splitting the two phases of inference onto separate pools of GPUs instead of crowding them onto one. Why: Prefill and decode have opposite bottlenecks — prefill is compute-bound, decode is memory-bandwidth-bound — so running them on the same worker wastes hardware and lets one long prompt stall everyone else's token stream . vs prior: A co-located server vanilla single-pool vLLM interleaves prefill and decode on the same GPUs; disaggregation runs each on its own pool tuned for its bottleneck, paying for it by shipping the KV cache across the interconnect between them. A restaurant kitchen that splits the prep station from the plating line. ORDER the prompt │ ▼ ┌──────────────┐ KV cart ┌──────────────┐ │ PREP STATION │ down the │ PLATING LINE │ │ prefill │═══ hallway ════▶│ decode │──▶ tokens │ compute-heavy│ KV transfer │ memory-bound │ └──────────────┘ └──────────────┘ chops a whole plates dishes order at once one at a time Prefill — The first phase of inference: the model reads your entire prompt in parallel in one pass, building the KV cache. It does a lot of math per byte of memory it touches, so it is compute-bound. Decode — The second phase: the model generates one output token at a time , and each step must read the whole KV cache plus all the weights to produce that single token. It moves a lot of memory for little math, so it is memory-bandwidth-bound . KV cache — The stored keys and values for every token already processed, so the model never recomputes them. It is the dominant memory cost of inference — and, in a disaggregated stack, the thing that has to travel from the prefill pool to the decode pool. Compute-bound vs memory-bound — The roofline distinction: a job is compute-bound when the GPU's math units are the limit, and memory-bound when memory bandwidth is. Prefill and decode sit on opposite sides of that line , which is the whole reason to split them. Disaggregation — Running prefill and decode on separate pools of workers instead of one shared pool, so each pool can be sized and scheduled for its own bottleneck. KV-aware scheduling — A scheduler that routes a request with knowledge of where its KV-cache blocks already live — so it can reuse a cached prefix prefix caching or steer a request to the worker that avoids a transfer. ROCm / AITER / MORI / Instinct — ROCm is AMD's CUDA-equivalent software stack and Instinct its datacenter GPU line. AITER supplies the optimized ROCm kernels the analogue of CUDA kernels , while MORI handles the distributed, RDMA-style communication for tensor/expert parallelism AMD's own collective library, RCCL, is the closer NCCL analogue . The news.On June 16, 2026, AMD publishedATOM + ATOMesh, a paired ROCm-native LLM serving stack for Instinct GPUs, shipped as an early alpha preview.ATOMis an AITER-optimized inference engine kernel acceleration via AITER, distributed communication via MORI ;ATOMeshis the orchestration layer on top — it exposes an OpenAI-compatible API, manages multiple engine backends, and appliesprefill/decode disaggregation and KV-aware scheduling, evaluated serving DeepSeek-V4-Pro on Instinct hardware. In AMD's framing it deliberately mirrors the vLLM/SGLang design — the same serving primitives, now on AMD silicon. Read the release → Picture a restaurant kitchen where one cook does everything. First they prep an order — chopping, slicing, mixing every ingredient the dish needs, all at once, in a furious burst of knife work. Then they plate it — assembling the dish one component at a time, walking back to the fridge for each piece. Prep is a flat-out, hands-busy job; plating is a lot of trips to the fridge and not much knife work. Cram both onto one cook and they fight: a big prep order makes every waiting plate go cold, and during the slow plating trips the knives sit idle. That single overloaded cook is one GPU running an LLM, and the two jobs are prefill and decode. When a model answers, it first runs prefill: it reads your entire prompt in one parallel pass , doing dense matrix math and filling the KV cache. Then it runs decode : it emits output one token per step , and every step drags the whole KV cache and all the weights out of memory to produce that single token. Prefill is compute-bound — limited by the GPU's math units — while decode is memory-bandwidth-bound, limited by how fast it can stream the cache out of memory. They are the prep cook and the plating cook: opposite appetites, forced to share one station. That opposite-appetites problem is why a single shared worker wastes hardware. Pack prefill and decode together and a long prompt's prefill burst blocks the queue of decode steps behind it — a head-of-line stall — while the memory-bound decodes leave the expensive compute units sitting idle. You can never shape one machine to be right for both jobs at once. Disaggregation is the fix: give prep and plating their own stations. Prefill runs on one pool of GPUs, scheduled for compute-heavy bursts; decode runs on a separate pool, scheduled for steady memory-bound streaming with large batches. When a request finishes prefill, the prefill worker hands its KV cache across the interconnect to a decode worker , which then streams the tokens out. Each pool is now sized and tuned for the one bottleneck it actually has — and AMD's ATOMesh is the orchestration layer that does exactly this routing on ROCm. This is the same playbook vLLM and SGLang made standard; ATOM + ATOMesh shows AMD building a ROCm-native path to it. But disaggregation is not free, and the bill comes due at the handoff. After prefill, the KV cache has to physically travel from the prefill pool to the decode pool. For a 70B-class model with a 2,048-token prompt, that cache is 2 × 80 layers × 8 KV-heads × 128 dim × 2,048 tokens × 2 B ≈ 0.67 GB illustrative, Llama-3.1-70B with grouped-query attention . Move it over PCIe 4.0 and you pay roughly 21 ms ; over NVLink, about 0.75 ms | Phase | What it processes | Bottleneck roofline | What it wants from the hardware | |---|---|---|---| Prefill | The whole prompt, in one parallel pass | Compute-bound — high arithmetic intensity | Raw matmul throughput; fewer, fatter GPUs | Decode | One output token per step, reading the full KV cache | Memory-bandwidth-bound — low arithmetic intensity | Memory bandwidth and large batches to amortize the weight reads | The honest caveat: ATOM + ATOMesh ship as an early alpha preview , and AMD's post describes the mechanism , not head-to-head numbers — it reports that ATOMesh mirrors the vLLM/SGLang design and was evaluated serving DeepSeek-V4-Pro, but it does not give usable numeric throughput or latency figures in the post text, so treat any performance claim as not yet quantified here and check the source for benchmarks. The KV-transfer figures above are illustrative , sized to a representative model rather than measured on ATOM. But the durable lesson stands: once you see that prefill and decode sit on opposite sides of the roofline, "one GPU does both" stops looking efficient — and a serving stack's real job is to split the two phases and move the KV cache between them cheaply. Goes deeper in: LLM Serving → Prefill/Decode Disaggregation → Disaggregation It is a serving design that runs the two phases of LLM inference on separate pools of GPUs. Prefill — reading the whole prompt in one parallel, compute-heavy pass — runs on one pool, and decode — generating output one token at a time, bottlenecked by memory bandwidth — runs on another. After prefill, the request's KV cache is transferred across the interconnect to a decode worker. Splitting them lets each pool be sized and scheduled for its own bottleneck instead of compromising on one shared machine. Because they have opposite bottlenecks. Prefill is compute-bound limited by the GPU's math units , while decode is memory-bandwidth-bound limited by how fast it streams the KV cache and weights out of memory . On one shared worker a long prefill stalls the decode steps queued behind it, and the memory-bound decodes leave the compute units idle. Running each phase on hardware tuned for its own limit avoids that mutual interference — at the cost of moving the KV cache between the two pools. ATOM is a ROCm-native inference engine optimized kernels via AITER, cross-GPU communication via MORI and ATOMesh is the orchestration layer above it — an OpenAI-compatible API that applies prefill/decode disaggregation and KV-aware scheduling. AMD describes it as deliberately mirroring the vLLM/SGLang design, so the contribution is not a new algorithm but the same modern serving primitives brought to AMD Instinct GPUs — a second-vendor implementation of the stack the LLM Serving track teaches. Originally posted on Learn AI Visually https://learnaivisually.com/ai-explained/amd-atom-prefill-decode-disaggregation .