{"slug": "glm5-2-on-amd-mi355x-at-2626-tok-s-node-at-over-2x-lower-cost-than-blackwell", "title": "GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell", "summary": "Wafer served GLM5.2 on AMD MI355X GPUs at 2626 tokens per second per node with over 2x lower cost than NVIDIA Blackwell, achieving 213 tok/s single stream. The company used MXFP4 quantization via AMD Quark and sglang inference engine, demonstrating competitive performance per dollar despite AMD's software ecosystem lag.", "body_md": "# Performance per dollar is getting faster and cheaper\n\nHow we served GLM5.2 on AMD MI355X at 2626 tok/s/node and 213 tok/s single stream at over 2x lower cost than Blackwell.\n\nHave you noticed we like AMD?\n\nThe demand for inference is skyrocketing and outpacing supply. With frontier models being released almost every other week — Claude Fable, GLM5.2, and Minimax M3, to name a few — the token craze is only getting crazier, and there aren’t enough Blackwells going around to support it. Thus, NVIDIA GPU prices are climbing fast, and tokens are getting really expensive.\n\nIn comes AMD. At around 2.75x cheaper per GPU on average (MI355X vs B300) with comparable hardware specs, the solution to cheap inference is hiding in plain sight — a message we at Wafer have been preaching for months. But although AMD’s Instinct MI350 series competes with Blackwells at the silicon level, NVIDIA’s software advantage and day-0 support typically allows providers to serve inference much faster on their hardware with much less friction.\n\nConversely, on the MI355X / ROCm stack SOTA performance rarely comes out of the box for these frontier models (sometimes it does!). In fact, you’re lucky if you can find an image that runs them at all. Without this day-0 support, building and optimizing for the newest models can require weeks of engineering and compute. By then, the newest model has already been released, making it so AMD is always playing catch-up.\n\nBut as agents improve at kernel and model optimization, this gap is closing in real time. At Wafer, we’ve proven this time and time again.\n\nAnd again — on a 20k in / 1k out, 60% cache hit rate workload, we hit an aggregate throughput of 2626 tok/s/node @ 2.4 rps with a defined knee of ≤5s TTFT — only 80% of the performance measured on a B200, despite being over 2x cheaper.\n\n| Sustained RPS | Aggregate tok/s/node | TTFT p50 / p95 | Success |\n|---|---|---|---|\n| 0.5 | 449 | 0.59s / 0.60s | 100% |\n| 1.0 | 974 | 0.60s / 0.81s | 100% |\n| 1.5 | 1913 | 0.62s / 1.03s | 100% |\n| 2.0 | 1944 | 0.62s / 1.05s | 100% |\n| 2.25 | 2089 | 0.63s / 1.23s | 100% |\n2.4 (saturation) |\n2626 |\n0.81s / 2.22s |\n100% |\n\nWe also hit 213 tok/s on GLM5.2 on 10k input tokens / 1.5k output tokens single stream, following [Artificial Analysis standards](https://artificialanalysis.ai/methodology/performance-benchmarking), served on AMD MI355X capacity from TensorWave. Though this number doesn’t top the AA leaderboard, it still wins on performance per dollar.\n\n## How we did it\n\nThe first step with any model work is to choose a quantization and framework. We quantized the base bf16 GLM-5.2 to MXFP4 with AMD Quark. In comparison to z-ai’s official FP8 quantization, our MXFP4 was lossless (GPQA-Diamond, tau2, GSM8K).\n\n| Eval | FP8 baseline | MXFP4 | Δ (MXFP4 − FP8) |\n|---|---|---|---|\n| GSM8K (200q, 5-shot, greedy) | 0.965 ± 0.013 | 0.955 ± 0.014 | −0.010 |\n| GPQA-Diamond (198q × 2 seeds, temp 1.0) | 0.9217 ± 0.027 | 0.9026 ± 0.029 | −0.019 |\n| tau2 macro | 0.819 | 0.834 | +0.015 |\n\nAs for the inference framework, we had three options — vLLM, ATOM, and sglang. Among the three, we chose sglang — vLLM had no working MXFP4 + GlmMoeDsa path so the MXFP4 weights provided no benefit, and ATOM’s output degraded at long context. Sglang was the inference engine with the least friction to native support, able to take advantage of the quantization while remaining coherent.\n\nThe next natural step to improving throughput was enabling speculative decode on sglang. However, the sglang ROCm image does not support this out of the box. There were two fixes needed before MTP worked properly.\n\nFirst, the MTP head, like every other layer, keeps its single shared expert stored in bf16, not MXFP4. However, the MTP head is registered under a different module prefix than the main decoder stack (Quark names its bf16 shared expert `model.layers.78.mlp.shared_experts.*`\n\n, while the MTP layer’s real prefix is `model.decoder.*`\n\n). Because of the mismatch, sglang’s quantization lookup fails and defaults to building that shared expert as MXFP4. At load it then tries to read a full-width bf16 weight into a half-width 4-bit slot and the init crashes on a shape mismatch. Quark records which weights to leave un-quantized as a list of layer names, so we copied over the layer 78 entries to that list a second time under the decoder name sglang actually uses. This fix unblocked speculative decode, netting us close to a 3x gain in single stream throughput.\n\nSecond, deep speculative decode (such as the 5/1/6 config z-ai suggests) was still blocked. The fused multi-step metadata kernel needed for draft depth ≥4 writes `#include <cuda_runtime.h>`\n\nwith no ROCm guard. Fix: one `#ifdef USE_ROCM`\n\nguard.\n\nTwo trivial, but necessary changes to take full advantage of speculative decode. With spec dec working properly, alongside a few config optimizations (such as `--kv-cache-dtype fp8_e4m3`\n\nand `--enable-aiter-allreduce-fusion`\n\n), we reached our headline single stream decode number at 213 tok/s.\n\nBut for aggregate throughput, especially with our defined workload, decode optimizations are necessary but insufficient. At 20k in @ 60% cache, the workload is primarily prefill bound.\n\nAt TP8, which was the configuration optimized for single stream decode, the MI355X can run GLM5.2-MXFP4 at 1461 tok/s/node. Switching to TP4×DP2 netted a massive improvement on this workload, getting us to 1944 tok/s/node at 2.0 RPS — still relatively slow compared to our measured Blackwell performance, which hit 3192 tok/s/node at 3.0 RPS. A big reason for the poor prefill performance on the MI355X is that on the sglang image, GLM-5.2’s fp4 MoE was silently on a slow FlyDSL heuristic fallback (aiter only shipped tuned configs for the a8w8/fp8 path). We tuned the MoE kernel selection ourselves on GLM’s fp4 shapes (`model_dim 6144, moe_inter 2048, E=256, topk=8`\n\n), which allowed us to reach 2626 tok/s/node at 2.4 RPS. Much better.\n\n## Why this matters\n\nAlthough there was some degree of friction, achieving the best performance per dollar ratio on the MI355X wasn’t particularly hard — though there were some framework related bugs, unlike our work with Qwen3.5 397B, you’ll notice that we didn’t actually write any custom kernels this time. Though this study doesn’t take multi-node performance into consideration, single-node deployments still remain highly prevalent in practice.\n\nSOTA on AMD is becoming more a matter of support, not software. The CUDA moat is eroding in real time.", "url": "https://wpnews.pro/news/glm5-2-on-amd-mi355x-at-2626-tok-s-node-at-over-2x-lower-cost-than-blackwell", "canonical_source": "https://www.wafer.ai/blog/glm52-amd", "published_at": "2026-07-03 21:49:06+00:00", "updated_at": "2026-07-03 22:19:55.919231+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-chips", "ai-tools", "ai-research"], "entities": ["Wafer", "AMD", "NVIDIA", "GLM5.2", "MI355X", "Blackwell", "TensorWave", "sglang"], "alternates": {"html": "https://wpnews.pro/news/glm5-2-on-amd-mi355x-at-2626-tok-s-node-at-over-2x-lower-cost-than-blackwell", "markdown": "https://wpnews.pro/news/glm5-2-on-amd-mi355x-at-2626-tok-s-node-at-over-2x-lower-cost-than-blackwell.md", "text": "https://wpnews.pro/news/glm5-2-on-amd-mi355x-at-2626-tok-s-node-at-over-2x-lower-cost-than-blackwell.txt", "jsonld": "https://wpnews.pro/news/glm5-2-on-amd-mi355x-at-2626-tok-s-node-at-over-2x-lower-cost-than-blackwell.jsonld"}}