TL;DR. GLM-5.2 (glm_moe_dsa
— DeepSeek Sparse Attention) does not run on Ampere (A100, sm_80) with stock vLLM: the sparse-MLA attention backend (FLASHMLA_SPARSE
) and the lightning-indexer's fp8_mqa_logits
(DeepGEMM) are Hopper/Blackwell-only. vLLM PR #38476 (issue #38006) adds a Triton sparse-MLA backend (TRITON_MLA_SPARSE
) + a bf16 Triton indexer fallback that run on Ampere. Cherry-picking it onto current main
is a Python-only change — no CUDA recompile. Result: GLM-5.2 AWQ-INT4 serves on 8× A100 at ~56 tok/s single-stream and ~625 tok/s aggregate decode (32-way), with coherent output.
This is an independent 8× A100 confirmation of PR #38476 (the author validated on 32× A100), plus a no-recompile install note. Credit to @haosdent for the PR.
- 8× A100 80GB (sm_80). ~410 GiB VRAM used at TP=8, so all 8 GPUs.
- A recent vLLM
main
(≈ 0.23.1rc1 era),torch
+triton
matching that build (this was tested with torch 2.11 / triton 3.6), anduv
(or pip). - ~440 GB free disk for the weights.
hf download cyankiwi/GLM-5.2-AWQ-INT4 # ~440 GB, compressed-tensors INT4 (Marlin)
git clone https://github.com/vllm-project/vllm && cd vllm
gh pr checkout 38476
Notes when cherry-picking onto recent main
:
One conflict, invllm/model_executor/layers/sparse_attn_indexer.py
.main
added an XPU dispatch branch the PR's base lacked. Resolve to a three-way dispatch:if is_xpu() → elif use_deep_gemm → else <triton fallback>
. In the indexer__init__
, replacemain
's hardRuntimeError
(when DeepGEMM is missing) with the PR's warn-and-fallback onnot is_deep_gemm_supported()
— that's what routes Ampere to the Triton path instead of aborting.Drop the PR'scsrc/.../fp4/nvfp4_quant_entry.cu
stub (a SM100/MXFP4 link shim made obsolete by an upstream refactor). Removing it keeps the changesetPython + docs only(~10 files): the newTRITON_MLA_SPARSE
backend,ops/mqa_logits_triton.py
,ops/triton_mla_sparse_kernel.py
, and registration inattention/backends/registry.py
+platforms/cuda.py
.
Because the changeset is Python-only, reuse vLLM's precompiled extension instead of building:
uv venv --python 3.12 .venv-glm52 && source .venv-glm52/bin/activate
VLLM_USE_PRECOMPILED=1 uv pip install -e . # editable install over the prebuilt wheel — no nvcc build
Sanity check on an A100:
import vllm; print(vllm.__version__)
from vllm.platforms import current_platform
from vllm.utils.deep_gemm import is_deep_gemm_supported
print(is_deep_gemm_supported()) # -> False on A100 (so the Triton fallback is used)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # all 8 GPUs: the 410 GB INT4 weights need TP=8
VLLM_ATTENTION_BACKEND=TRITON_MLA_SPARSE \
vllm serve cyankiwi/GLM-5.2-AWQ-INT4 \
--tensor-parallel-size 8 \
--no-async-scheduling \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--trust-remote-code \
--kv-cache-dtype auto \ # bf16 KV — do NOT use fp8 KV on Ampere
--port 8000
Confirm in the startup log (this is the proof it's on the sparse path, not a dense fallback):
[cuda.py] Using TRITON_MLA_SPARSE attention backend out of potential backends: ['TRITON_MLA_SPARSE']
[sparse_attn_indexer.py] DeepGEMM not supported on this platform; using Triton fallback for sparse attention indexer
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "cyankiwi/GLM-5.2-AWQ-INT4",
"messages": [{"role":"user","content":"What is 17 multiplied by 24? Explain briefly."}],
"max_tokens": 300}'
→ coherent reasoning ending in 408. (GLM-5.2 is a reasoning model: it emits <think>
chain-of-thought, then the final answer.)
| Concurrency | Aggregate decode tok/s | Total tok/s (prompt+gen) |
|---|---|---|
| 1 | 56 | 282 |
| 4 | 186 | 928 |
| 8 | 335 | 1,674 |
| 16 | 548 | 2,740 |
| 32 | 625 | |
| 3,127 |
Single-stream decode ~56 tok/s; aggregate decode scales to ~625 tok/s. For reference, llama.cpp serving the GGUF Q4 of the same model (it also implements glm-dsa
) tops out around ~70 tok/s aggregate decode — so vLLM is ~2.3× single-stream and ~9× in aggregate. Cold start is ~7 min (410 GB weights + CUDA-graph capture).
All 8 GPUs must be visible(CUDA_VISIBLE_DEVICES
) — TP=8 is required for the 410 GB weights.Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
is benign — it's still honored (the backend is also auto-selected on sm_80).- sm_80 disables SymmMem → falls back to CUSTOM/PyNccl all-reduce (benign).
- One-time Triton JIT compile on the first request — send a warm-up request.
- Use
bf16 KV cache(
--kv-cache-dtype auto
); fp8 KV is not supported on Ampere here. - Prompts longer than
--max-model-len
(32k above) need a larger window (costs concurrency, since MLA KV grows with context) or chunking.