# Running GLM-5.2 (753B DeepSeek-Sparse-Attention MoE) on 8x A100 80GB with vLLM — TRITON_MLA_SPARSE backend (PR #38476), no-recompile install, benchmarks

> Source: <https://gist.github.com/timinar/c8d2eca4e2ea7d11db57a1e6e62d06a2>
> Published: 2026-06-20 19:13:29+00:00

**TL;DR.** GLM-5.2 (`glm_moe_dsa`

— DeepSeek Sparse Attention) does **not** run on Ampere (A100, sm_80) with stock vLLM: the sparse-MLA attention backend (`FLASHMLA_SPARSE`

) and the lightning-indexer's `fp8_mqa_logits`

(DeepGEMM) are Hopper/Blackwell-only. vLLM **PR #38476** ([issue #38006](https://github.com/vllm-project/vllm/issues/38006)) adds a Triton sparse-MLA backend (`TRITON_MLA_SPARSE`

) + a bf16 Triton indexer fallback that run on Ampere. Cherry-picking it onto current `main`

is a **Python-only** change — no CUDA recompile. Result: **GLM-5.2 AWQ-INT4 serves on 8× A100 at ~56 tok/s single-stream and ~625 tok/s aggregate decode (32-way), with coherent output.**

This is an independent **8× A100** confirmation of PR #38476 (the author validated on 32× A100), plus a no-recompile install note. Credit to **@haosdent** for the PR.

- 8× A100 80GB (sm_80). ~410 GiB VRAM used at TP=8, so all 8 GPUs.
- A recent vLLM
`main`

(≈ 0.23.1rc1 era),`torch`

+`triton`

matching that build (this was tested with torch 2.11 / triton 3.6), and`uv`

(or pip). - ~440 GB free disk for the weights.

```
hf download cyankiwi/GLM-5.2-AWQ-INT4    # ~440 GB, compressed-tensors INT4 (Marlin)
git clone https://github.com/vllm-project/vllm && cd vllm

# Simplest: check out the PR branch directly (gh fetches it for you)
gh pr checkout 38476

# Or, to put it on top of current main (what we did): the PR commit is NOT on
# main, so fetch it explicitly first — otherwise the cherry-pick errors with
# "fatal: bad revision".
#   git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow origin
#   git fetch origin pull/38476/head:pr-38476
#   git cherry-pick pr-38476        # then resolve the one conflict below
```

Notes when cherry-picking onto recent `main`

:

**One conflict**, in`vllm/model_executor/layers/sparse_attn_indexer.py`

.`main`

added an XPU dispatch branch the PR's base lacked. Resolve to a three-way dispatch:`if is_xpu() → elif use_deep_gemm → else <triton fallback>`

. In the indexer`__init__`

, replace`main`

's hard`RuntimeError`

(when DeepGEMM is missing) with the PR's warn-and-fallback on`not is_deep_gemm_supported()`

— that's what routes Ampere to the Triton path instead of aborting.**Drop** the PR's`csrc/.../fp4/nvfp4_quant_entry.cu`

stub (a SM100/MXFP4 link shim made obsolete by an upstream refactor). Removing it keeps the changeset**Python + docs only**(~10 files): the new`TRITON_MLA_SPARSE`

backend,`ops/mqa_logits_triton.py`

,`ops/triton_mla_sparse_kernel.py`

, and registration in`attention/backends/registry.py`

+`platforms/cuda.py`

.

Because the changeset is Python-only, reuse vLLM's precompiled extension instead of building:

```
uv venv --python 3.12 .venv-glm52 && source .venv-glm52/bin/activate
# install torch/triton matching your vLLM build, then:
VLLM_USE_PRECOMPILED=1 uv pip install -e .   # editable install over the prebuilt wheel — no nvcc build
```

Sanity check on an A100:

``` python
import vllm; print(vllm.__version__)
from vllm.platforms import current_platform
from vllm.utils.deep_gemm import is_deep_gemm_supported
print(is_deep_gemm_supported())   # -> False on A100 (so the Triton fallback is used)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7   # all 8 GPUs: the 410 GB INT4 weights need TP=8
VLLM_ATTENTION_BACKEND=TRITON_MLA_SPARSE \
vllm serve cyankiwi/GLM-5.2-AWQ-INT4 \
  --tensor-parallel-size 8 \
  --no-async-scheduling \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --trust-remote-code \
  --kv-cache-dtype auto \        # bf16 KV — do NOT use fp8 KV on Ampere
  --port 8000
```

Confirm in the startup log (this is the proof it's on the sparse path, not a dense fallback):

```
[cuda.py] Using TRITON_MLA_SPARSE attention backend out of potential backends: ['TRITON_MLA_SPARSE']
[sparse_attn_indexer.py] DeepGEMM not supported on this platform; using Triton fallback for sparse attention indexer
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "cyankiwi/GLM-5.2-AWQ-INT4",
  "messages": [{"role":"user","content":"What is 17 multiplied by 24? Explain briefly."}],
  "max_tokens": 300}'
```

→ coherent reasoning ending in **408**. (GLM-5.2 is a reasoning model: it emits `<think>`

chain-of-thought, then the final answer.)

| Concurrency | Aggregate decode tok/s | Total tok/s (prompt+gen) |
|---|---|---|
| 1 | 56 | 282 |
| 4 | 186 | 928 |
| 8 | 335 | 1,674 |
| 16 | 548 | 2,740 |
| 32 | 625 |
3,127 |

Single-stream decode ~56 tok/s; aggregate decode scales to ~625 tok/s. For reference, **llama.cpp** serving the GGUF Q4 of the same model (it also implements `glm-dsa`

) tops out around **~70 tok/s aggregate decode** — so vLLM is ~2.3× single-stream and ~9× in aggregate. Cold start is ~7 min (410 GB weights + CUDA-graph capture).

**All 8 GPUs must be visible**(`CUDA_VISIBLE_DEVICES`

) — TP=8 is required for the 410 GB weights.`Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND`

is benign — it's still honored (the backend is also auto-selected on sm_80).- sm_80 disables SymmMem → falls back to CUSTOM/PyNccl all-reduce (benign).
- One-time Triton JIT compile on the first request — send a warm-up request.
- Use
**bf16 KV cache**(`--kv-cache-dtype auto`

); fp8 KV is not supported on Ampere here. - Prompts longer than
`--max-model-len`

(32k above) need a larger window (costs concurrency, since MLA KV grows with context) or chunking.
