# Wafer pushes GLM-5.2 Fast onto AMD as Nvidia inference costs bite

> Source: <https://runtimewire.com/article/wafer-glm-5-2-fast-amd-mi355x-vercel-openrouter>
> Published: 2026-07-03 23:43:34+00:00

[Wafer](https://www.wafer.ai/?ref=runtimewire), the inference startup founded by [Steven Arellano (@gpusteve)](https://x.com/gpusteve?ref=runtimewire) and [Emilio Andere (@gpuemi)](https://x.com/gpuemi?ref=runtimewire), published a new [engineering post](https://www.wafer.ai/blog/glm52-amd?ref=runtimewire) arguing that AMD's MI355X can carry frontier open-model inference close to Nvidia Blackwell performance if the software stack is tuned hard enough.

The claim is narrow, useful and vendor-published: on GLM-5.2, Wafer says it hit 2626 aggregate tokens per second per node at 2.4 requests per second on a 20,000 input token, 1,000 output token workload with a 60% cache hit rate. Wafer says that point held 0.81 second p50 time to first token, 2.22 second p95 time to first token and 100% success. Wafer also says the run reached about 80% of Wafer's measured B200 performance while using GPUs Wafer describes as over 2x cheaper.

The benchmark is not a neutral MLPerf-style result. Wafer chose the workload, published the comparison and set the cost assumptions. The number still matters because it is attached to distribution, not a lab note: [Vercel](https://vercel.com/changelog/glm-5-2-fast-via-wafer-now-available-on-ai-gateway?ref=runtimewire) made GLM 5.2 Fast via Wafer available on AI Gateway, and [OpenRouter](https://openrouter.ai/provider/wafer?ref=runtimewire) lists Wafer as a provider. Vercel says its own benchmarking found Wafer delivered 2x higher throughput than other serverless GLM-5.2 providers, with 170-plus tok/s in small-context tests and 200-plus tok/s in large-context tests.

For Arellano and Andere, the AMD post is the cleanest expression yet of Wafer's founding thesis. In Wafer's April seed-round note, the founders outlined complementary backgrounds: Arellano has worked on high-performance computing and AI infrastructure at Two Sigma, Google, and Sei Labs; Andere did ML security research at the University of Chicago and trained weather models at Argonne National Laboratory. Wafer raised 4 million dollars in that round, with backers including [Fifty Years](https://fiftyyears.com/?ref=runtimewire), [Y Combinator](https://www.ycombinator.com/?ref=runtimewire), [Liquid 2](https://www.liquid2.vc/?ref=runtimewire), and [NVIDIA Inception](https://www.nvidia.com/en-us/startups/inception/?ref=runtimewire).

[Emilio Andere on X](https://x.com/gpuemi/status/2072922100650901831?ref=runtimewire)

### The AMD bet

Wafer's post makes the case that AMD's gap with Nvidia in inference is increasingly a software and support problem. Wafer says MI355X GPUs are around 2.75x cheaper per GPU on average than B300s, with comparable hardware specs, while Nvidia's day-zero model support and CUDA software advantage usually let providers serve new models faster and with less friction.

That is the economic opening Wafer is trying to sell. Demand for inference capacity keeps moving toward long-context coding agents, tool-calling systems and batch workloads that burn through tokens. If every new model launch turns into a race for Blackwell capacity, providers with access to cheaper accelerators and enough systems talent can win accounts on price and speed.

Wafer's own homepage positions Wafer around serverless and dedicated inference for open-source LLMs. Wafer lists GLM-5.2-Fast at 3.00 dollars per million input tokens, 10.25 dollars per million output tokens and 0.50 dollars per million cached tokens. Wafer also lists a lower-priced GLM-5.2 tier at 1.20 dollars input, 4.10 dollars output and 0.20 dollars cache per million tokens. That spread shows the commercial shape of the product: Wafer can sell a fast tier to latency-sensitive users while keeping a cheaper tier for developers and agents that optimize for cost.

### What Wafer actually changed

The engineering path was less glamorous than the headline number. Wafer says it quantized the bf16 GLM-5.2 model to MXFP4 using AMD Quark, then compared vLLM, ATOM and sglang as serving options. Wafer picked sglang because vLLM lacked a working MXFP4 plus GlmMoeDsa path for the weights and ATOM degraded at long context, according to Wafer's post.

Speculative decoding then required two framework fixes on ROCm. Wafer says one issue came from a mismatch in how the MTP head's bf16 shared expert was named, causing sglang's quantization lookup to treat it as MXFP4 and crash on load. Wafer says copying the layer entries under the decoder name used by sglang unblocked speculative decoding and gave close to a 3x gain in single-stream throughput. A second issue came from a fused multi-step metadata kernel that included `cuda_runtime.h`

without a ROCm guard; Wafer describes the fix as one `#ifdef USE_ROCM`

guard.

Those fixes got Wafer to 213 tok/s single stream on a 10,000 input token, 1,500 output token GLM-5.2 workload using AMD MI355X capacity from TensorWave, following [Artificial Analysis methodology](https://artificialanalysis.ai/methodology/performance-benchmarking?ref=runtimewire). For the aggregate workload, Wafer says the bottleneck shifted to prefill. TP8, the single-stream configuration, reached 1461 tok/s/node; TP4xDP2 reached 1944 tok/s/node at 2.0 requests per second. Wafer says tuning MoE kernel selection for GLM's fp4 shapes pushed the result to 2626 tok/s/node at 2.4 requests per second.

Wafer emphasizes that the gains came from framework fixes, quantization choices, and kernel-selection tuning.

### Distribution is the second half of the story

Wafer's benchmark would have limited force without Vercel and OpenRouter. Vercel's AI Gateway gives developers a single place to call models, track usage and cost, and configure retries and failover. Vercel says users can call GLM 5.2 Fast by setting the model to `zai/glm-5.2-fast`

in the AI SDK.

That is valuable distribution for Wafer because inference buyers rarely want another undifferentiated endpoint. They want the fastest provider for a model at the moment they route traffic, with fallbacks when speed or availability breaks. OpenRouter plays a similar routing role for developers who switch among model providers. Wafer's commercial opportunity is to become the fast path inside those routers, especially for open models where performance and price move every few weeks.

There is still a hard ceiling on what can be concluded from Wafer's post. Wafer's cost comparison depends on GPU pricing assumptions that Wafer did not fully disclose. Wafer's B200 comparison is Wafer's own measured reference point. The benchmark is single-node, and Wafer explicitly says the study does not cover multi-node performance. Production buyers will also care about queueing behavior, regional capacity, reliability under burst traffic and quality differences introduced by quantization.

Even with those limits, the Wafer post is a useful marker for the inference market. The dominant assumption has been that Nvidia wins because Nvidia combines hardware, CUDA, libraries and early model support. Wafer is betting that a small systems team can chip away at that advantage one model launch at a time, using AMD capacity where the price-performance gap is large enough to reward the work. GLM-5.2 Fast is the current proof point. The larger question for Wafer is how often six people can repeat that cycle before the next model, next framework bug and next hardware target arrive.
