{"slug": "sglang-adds-dflash-to-push-qwen-3-5-397b-a17b-inference-up-to-4-3x-faster", "title": "SGLang adds DFlash to push Qwen 3.5 397B-A17B inference up to 4.3x faster", "summary": "Jian Chen, Yesheng Liang, and Zhijian Liu integrated Z Lab's DFlash block diffusion speculative decoding method into SGLang, collaborating with Modal and LMSYS. The team reports up to 4.31x higher throughput on Qwen 3.5 397B A17B inference compared to a non-speculative baseline, marking a significant infrastructure advance for large model serving.", "body_md": "Jian Chen (@jianchen1799), Yesheng Liang and Zhijian Liu (@zhijianliu ) have pushed DFlash, Z Lab's block diffusion speculative decoding method, into SGLang through a collaboration with Modal and LMSYS, with the teams reporting up to 4.31x higher throughput over a non speculative baseline on Qwen 3.5 397B A17B. The June 15 release is not another model launch dressed up as infrastructure. It is a serving stack bet: if open weight frontier models keep getting larger, the bottleneck shifts from ...", "url": "https://wpnews.pro/news/sglang-adds-dflash-to-push-qwen-3-5-397b-a17b-inference-up-to-4-3x-faster", "canonical_source": "https://runtimewire.com/article/sglang-dflash-qwen-397b-inference-throughput", "published_at": "2026-06-15 17:54:35+00:00", "updated_at": "2026-06-15 18:12:29.820048+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-research"], "entities": ["SGLang", "DFlash", "Z Lab", "Modal", "LMSYS", "Qwen 3.5", "Jian Chen", "Zhijian Liu"], "alternates": {"html": "https://wpnews.pro/news/sglang-adds-dflash-to-push-qwen-3-5-397b-a17b-inference-up-to-4-3x-faster", "markdown": "https://wpnews.pro/news/sglang-adds-dflash-to-push-qwen-3-5-397b-a17b-inference-up-to-4-3x-faster.md", "text": "https://wpnews.pro/news/sglang-adds-dflash-to-push-qwen-3-5-397b-a17b-inference-up-to-4-3x-faster.txt", "jsonld": "https://wpnews.pro/news/sglang-adds-dflash-to-push-qwen-3-5-397b-a17b-inference-up-to-4-3x-faster.jsonld"}}