Jian Chen (@jianchen1799), Yesheng Liang and Zhijian Liu (@zhijianliu ) have pushed DFlash, Z Lab's block diffusion speculative decoding method, into SGLang through a collaboration with Modal and LMSYS, with the teams reporting up to 4.31x higher throughput over a non speculative baseline on Qwen 3.5 397B A17B. The June 15 release is not another model launch dressed up as infrastructure. It is a serving stack bet: if open weight frontier models keep getting larger, the bottleneck shifts from ...
Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding