Wafer pushes GLM-5.2 Fast onto AMD as Nvidia inference costs bite

Inference startup Wafer claims it achieved 80% of Nvidia B200 performance on AMD MI355X GPUs running GLM-5.2, with 2.6x cheaper hardware, by optimizing the software stack. The benchmark, published by Wafer, shows 2626 tokens per second per node at 2.4 requests per second, and the model is now available via Vercel and OpenRouter. Wafer's thesis is that AMD's inference gap with Nvidia is primarily a software problem, and the startup aims to win customers by offering cheaper alternatives for long-context AI workloads.

Wafer https://www.wafer.ai/?ref=runtimewire , the inference startup founded by Steven Arellano @gpusteve https://x.com/gpusteve?ref=runtimewire and Emilio Andere @gpuemi https://x.com/gpuemi?ref=runtimewire , published a new engineering post https://www.wafer.ai/blog/glm52-amd?ref=runtimewire arguing that AMD's MI355X can carry frontier open-model inference close to Nvidia Blackwell performance if the software stack is tuned hard enough. The claim is narrow, useful and vendor-published: on GLM-5.2, Wafer says it hit 2626 aggregate tokens per second per node at 2.4 requests per second on a 20,000 input token, 1,000 output token workload with a 60% cache hit rate. Wafer says that point held 0.81 second p50 time to first token, 2.22 second p95 time to first token and 100% success. Wafer also says the run reached about 80% of Wafer's measured B200 performance while using GPUs Wafer describes as over 2x cheaper. The benchmark is not a neutral MLPerf-style result. Wafer chose the workload, published the comparison and set the cost assumptions. The number still matters because it is attached to distribution, not a lab note: Vercel https://vercel.com/changelog/glm-5-2-fast-via-wafer-now-available-on-ai-gateway?ref=runtimewire made GLM 5.2 Fast via Wafer available on AI Gateway, and OpenRouter https://openrouter.ai/provider/wafer?ref=runtimewire lists Wafer as a provider. Vercel says its own benchmarking found Wafer delivered 2x higher throughput than other serverless GLM-5.2 providers, with 170-plus tok/s in small-context tests and 200-plus tok/s in large-context tests. For Arellano and Andere, the AMD post is the cleanest expression yet of Wafer's founding thesis. In Wafer's April seed-round note, the founders outlined complementary backgrounds: Arellano has worked on high-performance computing and AI infrastructure at Two Sigma, Google, and Sei Labs; Andere did ML security research at the University of Chicago and trained weather models at Argonne National Laboratory. Wafer raised 4 million dollars in that round, with backers including Fifty Years https://fiftyyears.com/?ref=runtimewire , Y Combinator https://www.ycombinator.com/?ref=runtimewire , Liquid 2 https://www.liquid2.vc/?ref=runtimewire , and NVIDIA Inception https://www.nvidia.com/en-us/startups/inception/?ref=runtimewire . Emilio Andere on X https://x.com/gpuemi/status/2072922100650901831?ref=runtimewire The AMD bet Wafer's post makes the case that AMD's gap with Nvidia in inference is increasingly a software and support problem. Wafer says MI355X GPUs are around 2.75x cheaper per GPU on average than B300s, with comparable hardware specs, while Nvidia's day-zero model support and CUDA software advantage usually let providers serve new models faster and with less friction. That is the economic opening Wafer is trying to sell. Demand for inference capacity keeps moving toward long-context coding agents, tool-calling systems and batch workloads that burn through tokens. If every new model launch turns into a race for Blackwell capacity, providers with access to cheaper accelerators and enough systems talent can win accounts on price and speed. Wafer's own homepage positions Wafer around serverless and dedicated inference for open-source LLMs. Wafer lists GLM-5.2-Fast at 3.00 dollars per million input tokens, 10.25 dollars per million output tokens and 0.50 dollars per million cached tokens. Wafer also lists a lower-priced GLM-5.2 tier at 1.20 dollars input, 4.10 dollars output and 0.20 dollars cache per million tokens. That spread shows the commercial shape of the product: Wafer can sell a fast tier to latency-sensitive users while keeping a cheaper tier for developers and agents that optimize for cost. What Wafer actually changed The engineering path was less glamorous than the headline number. Wafer says it quantized the bf16 GLM-5.2 model to MXFP4 using AMD Quark, then compared vLLM, ATOM and sglang as serving options. Wafer picked sglang because vLLM lacked a working MXFP4 plus GlmMoeDsa path for the weights and ATOM degraded at long context, according to Wafer's post. Speculative decoding then required two framework fixes on ROCm. Wafer says one issue came from a mismatch in how the MTP head's bf16 shared expert was named, causing sglang's quantization lookup to treat it as MXFP4 and crash on load. Wafer says copying the layer entries under the decoder name used by sglang unblocked speculative decoding and gave close to a 3x gain in single-stream throughput. A second issue came from a fused multi-step metadata kernel that included cuda runtime.h without a ROCm guard; Wafer describes the fix as one ifdef USE ROCM guard. Those fixes got Wafer to 213 tok/s single stream on a 10,000 input token, 1,500 output token GLM-5.2 workload using AMD MI355X capacity from TensorWave, following Artificial Analysis methodology https://artificialanalysis.ai/methodology/performance-benchmarking?ref=runtimewire . For the aggregate workload, Wafer says the bottleneck shifted to prefill. TP8, the single-stream configuration, reached 1461 tok/s/node; TP4xDP2 reached 1944 tok/s/node at 2.0 requests per second. Wafer says tuning MoE kernel selection for GLM's fp4 shapes pushed the result to 2626 tok/s/node at 2.4 requests per second. Wafer emphasizes that the gains came from framework fixes, quantization choices, and kernel-selection tuning. Distribution is the second half of the story Wafer's benchmark would have limited force without Vercel and OpenRouter. Vercel's AI Gateway gives developers a single place to call models, track usage and cost, and configure retries and failover. Vercel says users can call GLM 5.2 Fast by setting the model to zai/glm-5.2-fast in the AI SDK. That is valuable distribution for Wafer because inference buyers rarely want another undifferentiated endpoint. They want the fastest provider for a model at the moment they route traffic, with fallbacks when speed or availability breaks. OpenRouter plays a similar routing role for developers who switch among model providers. Wafer's commercial opportunity is to become the fast path inside those routers, especially for open models where performance and price move every few weeks. There is still a hard ceiling on what can be concluded from Wafer's post. Wafer's cost comparison depends on GPU pricing assumptions that Wafer did not fully disclose. Wafer's B200 comparison is Wafer's own measured reference point. The benchmark is single-node, and Wafer explicitly says the study does not cover multi-node performance. Production buyers will also care about queueing behavior, regional capacity, reliability under burst traffic and quality differences introduced by quantization. Even with those limits, the Wafer post is a useful marker for the inference market. The dominant assumption has been that Nvidia wins because Nvidia combines hardware, CUDA, libraries and early model support. Wafer is betting that a small systems team can chip away at that advantage one model launch at a time, using AMD capacity where the price-performance gap is large enough to reward the work. GLM-5.2 Fast is the current proof point. The larger question for Wafer is how often six people can repeat that cycle before the next model, next framework bug and next hardware target arrive.