AMD's GLM-5.2 win over Blackwell is a software story AMD's Instinct MI355X achieves roughly 80% of NVIDIA B200 throughput on GLM-5.2 inference at over 2x lower cost, delivering more than 2x better tokens-per-dollar after hand-tuning the ROCm software stack. Independent tests from Wafer and SemiAnalysis confirm the economic advantage, though achieving parity required manual optimization of quantization and kernel integration. AI https://sourcefeed.dev/c/ai Article AMD's GLM-5.2 win over Blackwell is a software story MI355X hits about 80% of B200 throughput at over 2x lower cost, but only after hand-tuning the ROCm stack. Priya Nair https://sourcefeed.dev/u/priya nair AMD's Instinct MI355X https://rocm.docs.amd.com has never had a silicon problem. On paper it trades blows with NVIDIA's Blackwell parts, and it's cheaper per GPU. The problem has always been the week or three of engineering between "the frontier model shipped" and "the frontier model runs well on ROCm." NVIDIA ships day-0 kernels and recipes; AMD tends to arrive later, by which point the next model has dropped and the catch-up clock resets. Two independent data points now suggest that gap has narrowed to something a competent inference team can close by hand, and that when they do, the economics flip in AMD's favor. That's the real story here, and it matters more for procurement than any single throughput number. The numbers, and who's saying them The headline comes from Wafer https://www.wafer.ai , which ran GLM-5.2 https://huggingface.co/zai-org/GLM-5.2 on MI355X capacity from TensorWave and hit an aggregate 2,626 tok/s/node at 2.4 requests per second on a 20k-in / 1k-out workload with a 60% cache-hit rate, holding time-to-first-token under a 5s knee. Their own comparison puts that at roughly 80% of what they measured on a B200 which topped out around 3,192 tok/s/node at 3.0 RPS . Same team also reports 213 tok/s single-stream on a 10k/1.5k workload following Artificial Analysis conventions. Here's the ramp, which tells you more than the peak does: | Sustained RPS | Aggregate tok/s/node | TTFT p50 / p95 | |---|---|---| | 0.5 | 449 | 0.59s / 0.60s | | 1.0 | 974 | 0.60s / 0.81s | | 1.5 | 1,913 | 0.62s / 1.03s | | 2.0 | 1,944 | 0.62s / 1.05s | | 2.25 | 2,089 | 0.63s / 1.23s | | 2.4 saturation | 2,626 | 0.81s / 2.22s | The interesting bit is what that 20% throughput deficit buys you back. Wafer pegs MI355X at about 2.75x cheaper per GPU than a B300 with comparable specs. Do the arithmetic: ~80% of the performance at ~2.75x lower hardware cost lands you north of 2x better tokens-per-dollar. That's the whole thesis in one line. And it isn't just a vendor with AMD capacity talking its book. SemiAnalysis, running GLM-5 the prior release in FP8 on SGLang, found MI355X undercutting a B200 on cost per million tokens across most of the single-node Pareto frontier, with a peak gap of 1.41x about a 40% reduction at 18 tokens/sec/user. Their TCO model puts MI355X at $1.48/GPU/hr against B200 at $1.95. Different model, different precision, different methodology, same direction. When a vendor and an independent shop reach the same conclusion by different roads, the conclusion is worth taking seriously. One honest caveat on the comparison: the perf number is measured against a B200, while the price advantage is quoted against a B300. Not strictly apples-to-apples, and both figures are self-reported by parties with a point to prove. Treat the exact ratio as directional, not gospel. The catch is spelled R-O-C-m Getting there was not push-button, and the details are the actual lesson. GLM-5.2 is a ~753B-parameter sparse MoE 256 experts, top-8 routing, ~40B activated built on the glm moe dsa architecture, which pairs DeepSeek Sparse Attention with MLA-style KV compression. It's a big, awkward model with a built-in MTP head for speculative decoding. Making it fly on AMD took a chain of unglamorous fixes. Start with quantization. Wafer took the BF16 weights down to MXFP4 using AMD's Quark toolkit, and their evals show it holding accuracy against Z.ai's official FP8: GSM8K 0.955 vs 0.965, GPQA-Diamond 0.9026 vs 0.9217, tau2 macro actually up at 0.834 vs 0.819. Call it lossless within noise. Worth noting the format divergence: AMD's path is MXFP4 open microscaling , while NVIDIA publishes an NVFP4 build for Blackwell that quantizes only the MoE expert linears. Both are 4-bit, neither is portable across the other's stack. Framework selection was a process of elimination. SGLang https://github.com/sgl-project/sglang won because vLLM had no working MXFP4 + GlmMoeDsa path so the 4-bit weights bought nothing and ATOM degraded at long context. Even on SGLang, the ROCm image fought back: - Speculative decode crashed on load because the MTP head's BF16 shared expert is registered under a different module prefix model.decoder. than the decoder stack Quark labeled model.layers.78.mlp.shared experts. . The quant lookup missed, tried to jam a full-width BF16 tensor into a 4-bit slot, and died on a shape mismatch. The fix was copying the layer-78 skip entries under the prefix SGLang actually uses. That alone netted close to 3x on single-stream throughput. - Deep spec decode the 5/1/6 config Z.ai suggests was blocked because a fused multi-step metadata kernel did include