AIArticle MI355X hits about 80% of B200 throughput at over 2x lower cost, but only after hand-tuning the ROCm stack.
Priya Nair AMD's Instinct MI355X has never had a silicon problem. On paper it trades blows with NVIDIA's Blackwell parts, and it's cheaper per GPU. The problem has always been the week or three of engineering between "the frontier model shipped" and "the frontier model runs well on ROCm." NVIDIA ships day-0 kernels and recipes; AMD tends to arrive later, by which point the next model has dropped and the catch-up clock resets.
Two independent data points now suggest that gap has narrowed to something a competent inference team can close by hand, and that when they do, the economics flip in AMD's favor. That's the real story here, and it matters more for procurement than any single throughput number.
The numbers, and who's saying them #
The headline comes from Wafer, which ran GLM-5.2 on MI355X capacity from TensorWave and hit an aggregate 2,626 tok/s/node at 2.4 requests per second on a 20k-in / 1k-out workload with a 60% cache-hit rate, holding time-to-first-token under a 5s knee. Their own comparison puts that at roughly 80% of what they measured on a B200 (which topped out around 3,192 tok/s/node at 3.0 RPS). Same team also reports 213 tok/s single-stream on a 10k/1.5k workload following Artificial Analysis conventions.
Here's the ramp, which tells you more than the peak does:
| Sustained RPS | Aggregate tok/s/node | TTFT p50 / p95 |
|---|---|---|
| 0.5 | 449 | 0.59s / 0.60s |
| 1.0 | 974 | 0.60s / 0.81s |
| 1.5 | 1,913 | 0.62s / 1.03s |
| 2.0 | 1,944 | 0.62s / 1.05s |
| 2.25 | 2,089 | 0.63s / 1.23s |
| 2.4 (saturation) | 2,626 | 0.81s / 2.22s |
The interesting bit is what that 20% throughput deficit buys you back. Wafer pegs MI355X at about 2.75x cheaper per GPU than a B300 with comparable specs. Do the arithmetic: ~80% of the performance at ~2.75x lower hardware cost lands you north of 2x better tokens-per-dollar. That's the whole thesis in one line.
And it isn't just a vendor with AMD capacity talking its book. SemiAnalysis, running GLM-5 (the prior release) in FP8 on SGLang, found MI355X undercutting a B200 on cost per million tokens across most of the single-node Pareto frontier, with a peak gap of 1.41x (about a 40% reduction) at 18 tokens/sec/user. Their TCO model puts MI355X at $1.48/GPU/hr against B200 at $1.95. Different model, different precision, different methodology, same direction. When a vendor and an independent shop reach the same conclusion by different roads, the conclusion is worth taking seriously.
One honest caveat on the comparison: the perf number is measured against a B200, while the price advantage is quoted against a B300. Not strictly apples-to-apples, and both figures are self-reported by parties with a point to prove. Treat the exact ratio as directional, not gospel.
The catch is spelled R-O-C-m #
Getting there was not push-button, and the details are the actual lesson. GLM-5.2 is a ~753B-parameter sparse MoE (256 experts, top-8 routing, ~40B activated) built on the glm_moe_dsa
architecture, which pairs DeepSeek Sparse Attention with MLA-style KV compression. It's a big, awkward model with a built-in MTP head for speculative decoding. Making it fly on AMD took a chain of unglamorous fixes.
Start with quantization. Wafer took the BF16 weights down to MXFP4 using AMD's Quark toolkit, and their evals show it holding accuracy against Z.ai's official FP8: GSM8K 0.955 vs 0.965, GPQA-Diamond 0.9026 vs 0.9217, tau2 macro actually up at 0.834 vs 0.819. Call it lossless within noise. Worth noting the format divergence: AMD's path is MXFP4 (open microscaling), while NVIDIA publishes an NVFP4 build for Blackwell that quantizes only the MoE expert linears. Both are 4-bit, neither is portable across the other's stack.
Framework selection was a process of elimination. SGLang won because vLLM had no working MXFP4 + GlmMoeDsa
path (so the 4-bit weights bought nothing) and ATOM degraded at long context. Even on SGLang, the ROCm image fought back:
- Speculative decode crashed on load because the MTP head's BF16 shared expert is registered under a different module prefix (
model.decoder.*
) than the decoder stack Quark labeled (model.layers.78.mlp.shared_experts.*
). The quant lookup missed, tried to jam a full-width BF16 tensor into a 4-bit slot, and died on a shape mismatch. The fix was copying the layer-78 skip entries under the prefix SGLang actually uses. That alone netted close to 3x on single-stream throughput. - Deep spec decode (the 5/1/6 config Z.ai suggests) was blocked because a fused multi-step metadata kernel did
#include <cuda_runtime.h>
with no ROCm guard. One#ifdef USE_ROCM
unblocked it. - The big aggregate win came from prefill. At 20k input with 60% cache, this workload is prefill-bound, and GLM-5.2's FP4 MoE was silently falling back to a slow FlyDSL heuristic because aiter only shipped tuned configs for the FP8 path. Hand-tuning MoE kernel selection on the FP4 shapes, plus switching from TP8 to TP4×DP2, moved the node from 1,461 to 2,626 tok/s.
None of these are hard once you know where to look. Every one requires an engineer who can read a stack trace across a quantizer, an inference engine, and a GPU kernel, and isn't afraid to patch the container image. That skill set is the actual moat NVIDIA has been renting out via day-0 support.
What this means if you buy or run inference #
Split your readers into two camps.
If you consume GLM-5.2 through an API (it's live via Vercel AI Gateway and OpenRouter), this is pure upside you don't have to work for. The provider eats the ROCm plumbing; you get cheaper tokens on the same model. Nothing changes in your code. Watch for a widening price spread between AMD-served and NVIDIA-served endpoints for the same open-weight models, and route accordingly.
If you run your own fleet, the calculus is more interesting and more demanding. The procurement pitch is real: for high-throughput batch and RAG-style workloads (long input, short output, cache-friendly), MI355X can beat Blackwell on cost per token today, and the raw memory helps. GLM-5.2 in BF16 is ~1.5TB, which fits a single MI355X node's 2.3TB HBM, whereas on the NVIDIA side you're reaching for 8×B300 or a multi-node layout. Per SGLang's own guidance, FP8 is still the recommended deployment and BF16 needs the big boxes.
But price the software tax honestly. You are signing up to either hire ROCm-fluent engineers or accept that you're 4 to 14 weeks behind on day-0 for each new frontier model. SemiAnalysis measured roughly a quarter between GLM-5's release and AMD reaching FP8 feature parity upstream. If your business depends on serving the newest model the week it drops, that lag is the cost, and it can swamp the per-GPU savings. If you're serving a stable set of open-weight models where a few weeks of lead time is fine, the math tilts hard toward AMD.
The other quiet trend worth flagging: Wafer credits improving agentic kernel and model optimization for closing this gap in real time. Whether or not you buy that framing, the direction is clear. The manual work that used to take weeks is getting cheaper to perform, which structurally erodes NVIDIA's day-0 advantage over time.
The verdict #
This is a genuine shift, not hype, but a narrow one. AMD wins on tokens-per-dollar for GLM-class inference right now, and two independent sources agree on the direction. It does not win on ease, and it does not win on day-0. The MI355X story is no longer "can it run the model" but "who does the plumbing and how fast." For API consumers that's someone else's problem and free money. For fleet operators it's a real trade: a couple of x on cost against a real dependency on ROCm expertise and a tolerance for lag. If you've got that expertise, the spreadsheet already favors AMD. If you don't, NVIDIA is still selling you the one thing AMD can't yet ship in a box, which is not having to think about any of this.
Sources & further reading #
[GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell](https://www.wafer.ai/blog/glm52-amd)— wafer.ai -
[hckr news - Hacker News sorted by time](https://hckrnews.com/)— hckrnews.com -
AMD MI355X GLM-5 Inference: Up to 40% Cheaper per Million Tokens than B200 on SGLang FP8 | InferenceX by SemiAnalysis— inferencex.semianalysis.com -
[GLM-5.2 VRAM Requirements & Cheapest GPU to Run It from $6.40/hr | Spheron](https://www.spheron.network/tools/gpu-recommender/zai-org/GLM-5.2)— spheron.network -
[GLM-5.2 - SGLang Documentation](https://lmsysorg.mintlify.app/cookbook/autoregressive/GLM/GLM-5.2)— lmsysorg.mintlify.app
[Priya Nair](https://sourcefeed.dev/u/priya_nair)· AI & Developer Experience Writer
Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.
Discussion 0 #
No comments yet
Be the first to weigh in.