Running GLM-5.2 (753B DeepSeek-Sparse-Attention MoE) on 8x A100 80GB with vLLM — TRITON_MLA_SPARSE backend (PR #38476), no-recompile install, benchmarks

A developer confirmed that GLM-5.2, a 753B-parameter DeepSeek-Sparse-Attention MoE model, runs on 8x A100 80GB GPUs using vLLM PR #38476, which adds a Triton sparse-MLA backend for Ampere architectures. The setup achieved ~56 tok/s single-stream and ~625 tok/s aggregate decode (32-way) with AWQ-INT4 quantization, requiring only a Python-only cherry-pick without CUDA recompilation.

TL;DR. GLM-5.2 glm moe dsa — DeepSeek Sparse Attention does not run on Ampere A100, sm 80 with stock vLLM: the sparse-MLA attention backend FLASHMLA SPARSE and the lightning-indexer's fp8 mqa logits DeepGEMM are Hopper/Blackwell-only. vLLM PR 38476 issue 38006 https://github.com/vllm-project/vllm/issues/38006 adds a Triton sparse-MLA backend TRITON MLA SPARSE + a bf16 Triton indexer fallback that run on Ampere. Cherry-picking it onto current main is a Python-only change — no CUDA recompile. Result: GLM-5.2 AWQ-INT4 serves on 8× A100 at ~56 tok/s single-stream and ~625 tok/s aggregate decode 32-way , with coherent output. This is an independent 8× A100 confirmation of PR 38476 the author validated on 32× A100 , plus a no-recompile install note. Credit to @haosdent for the PR. - 8× A100 80GB sm 80 . ~410 GiB VRAM used at TP=8, so all 8 GPUs. - A recent vLLM main ≈ 0.23.1rc1 era , torch + triton matching that build this was tested with torch 2.11 / triton 3.6 , and uv or pip . - ~440 GB free disk for the weights. hf download cyankiwi/GLM-5.2-AWQ-INT4 ~440 GB, compressed-tensors INT4 Marlin git clone https://github.com/vllm-project/vllm && cd vllm Simplest: check out the PR branch directly gh fetches it for you gh pr checkout 38476 Or, to put it on top of current main what we did : the PR commit is NOT on main, so fetch it explicitly first — otherwise the cherry-pick errors with "fatal: bad revision". git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow origin git fetch origin pull/38476/head:pr-38476 git cherry-pick pr-38476 then resolve the one conflict below Notes when cherry-picking onto recent main : One conflict , in vllm/model executor/layers/sparse attn indexer.py . main added an XPU dispatch branch the PR's base lacked. Resolve to a three-way dispatch: if is xpu → elif use deep gemm → else <triton fallback . In the indexer init , replace main 's hard RuntimeError when DeepGEMM is missing with the PR's warn-and-fallback on not is deep gemm supported — that's what routes Ampere to the Triton path instead of aborting. Drop the PR's csrc/.../fp4/nvfp4 quant entry.cu stub a SM100/MXFP4 link shim made obsolete by an upstream refactor . Removing it keeps the changeset Python + docs only ~10 files : the new TRITON MLA SPARSE backend, ops/mqa logits triton.py , ops/triton mla sparse kernel.py , and registration in attention/backends/registry.py + platforms/cuda.py . Because the changeset is Python-only, reuse vLLM's precompiled extension instead of building: uv venv --python 3.12 .venv-glm52 && source .venv-glm52/bin/activate install torch/triton matching your vLLM build, then: VLLM USE PRECOMPILED=1 uv pip install -e . editable install over the prebuilt wheel — no nvcc build Sanity check on an A100: python import vllm; print vllm. version from vllm.platforms import current platform from vllm.utils.deep gemm import is deep gemm supported print is deep gemm supported - False on A100 so the Triton fallback is used export CUDA VISIBLE DEVICES=0,1,2,3,4,5,6,7 all 8 GPUs: the 410 GB INT4 weights need TP=8 VLLM ATTENTION BACKEND=TRITON MLA SPARSE \ vllm serve cyankiwi/GLM-5.2-AWQ-INT4 \ --tensor-parallel-size 8 \ --no-async-scheduling \ --gpu-memory-utilization 0.90 \ --max-model-len 32768 \ --trust-remote-code \ --kv-cache-dtype auto \ bf16 KV — do NOT use fp8 KV on Ampere --port 8000 Confirm in the startup log this is the proof it's on the sparse path, not a dense fallback : cuda.py Using TRITON MLA SPARSE attention backend out of potential backends: 'TRITON MLA SPARSE' sparse attn indexer.py DeepGEMM not supported on this platform; using Triton fallback for sparse attention indexer curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{ "model": "cyankiwi/GLM-5.2-AWQ-INT4", "messages": {"role":"user","content":"What is 17 multiplied by 24? Explain briefly."} , "max tokens": 300}' → coherent reasoning ending in 408 . GLM-5.2 is a reasoning model: it emits <think chain-of-thought, then the final answer. | Concurrency | Aggregate decode tok/s | Total tok/s prompt+gen | |---|---|---| | 1 | 56 | 282 | | 4 | 186 | 928 | | 8 | 335 | 1,674 | | 16 | 548 | 2,740 | | 32 | 625 | 3,127 | Single-stream decode ~56 tok/s; aggregate decode scales to ~625 tok/s. For reference, llama.cpp serving the GGUF Q4 of the same model it also implements glm-dsa tops out around ~70 tok/s aggregate decode — so vLLM is ~2.3× single-stream and ~9× in aggregate. Cold start is ~7 min 410 GB weights + CUDA-graph capture . All 8 GPUs must be visible CUDA VISIBLE DEVICES — TP=8 is required for the 410 GB weights. Unknown vLLM environment variable detected: VLLM ATTENTION BACKEND is benign — it's still honored the backend is also auto-selected on sm 80 .- sm 80 disables SymmMem → falls back to CUSTOM/PyNccl all-reduce benign . - One-time Triton JIT compile on the first request — send a warm-up request. - Use bf16 KV cache --kv-cache-dtype auto ; fp8 KV is not supported on Ampere here. - Prompts longer than --max-model-len 32k above need a larger window costs concurrency, since MLA KV grows with context or chunking.