Running GLM-5.2 (753B DeepSeek-Sparse-Attention MoE) on 8x A100 80GB with vLLM — TRITON_MLA_SPARSE backend (PR #38476), no-recompile install, benchmarks A developer confirmed that GLM-5.2, a 753B-parameter DeepSeek-Sparse-Attention MoE model, runs on 8x A100 80GB GPUs using vLLM PR #38476, which adds a Triton sparse-MLA backend for Ampere architectures. The setup achieved ~56 tok/s single-stream and ~625 tok/s aggregate decode (32-way) with AWQ-INT4 quantization, requiring only a Python-only cherry-pick without CUDA recompilation. TL;DR. GLM-5.2 glm moe dsa — DeepSeek Sparse Attention does not run on Ampere A100, sm 80 with stock vLLM: the sparse-MLA attention backend FLASHMLA SPARSE and the lightning-indexer's fp8 mqa logits DeepGEMM are Hopper/Blackwell-only. vLLM PR 38476 issue 38006 https://github.com/vllm-project/vllm/issues/38006 adds a Triton sparse-MLA backend TRITON MLA SPARSE + a bf16 Triton indexer fallback that run on Ampere. Cherry-picking it onto current main is a Python-only change — no CUDA recompile. Result: GLM-5.2 AWQ-INT4 serves on 8× A100 at ~56 tok/s single-stream and ~625 tok/s aggregate decode 32-way , with coherent output. This is an independent 8× A100 confirmation of PR 38476 the author validated on 32× A100 , plus a no-recompile install note. Credit to @haosdent for the PR. - 8× A100 80GB sm 80 . ~410 GiB VRAM used at TP=8, so all 8 GPUs. - A recent vLLM main ≈ 0.23.1rc1 era , torch + triton matching that build this was tested with torch 2.11 / triton 3.6 , and uv or pip . - ~440 GB free disk for the weights. hf download cyankiwi/GLM-5.2-AWQ-INT4 ~440 GB, compressed-tensors INT4 Marlin git clone https://github.com/vllm-project/vllm && cd vllm Simplest: check out the PR branch directly gh fetches it for you gh pr checkout 38476 Or, to put it on top of current main what we did : the PR commit is NOT on main, so fetch it explicitly first — otherwise the cherry-pick errors with "fatal: bad revision". git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow origin git fetch origin pull/38476/head:pr-38476 git cherry-pick pr-38476 then resolve the one conflict below Notes when cherry-picking onto recent main : One conflict , in vllm/model executor/layers/sparse attn indexer.py . main added an XPU dispatch branch the PR's base lacked. Resolve to a three-way dispatch: if is xpu → elif use deep gemm → else