cd /news/large-language-models/running-glm-5-2-753b-deepseek-sparse… · home topics large-language-models article
[ARTICLE · art-37021] src=gist.github.com ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Running GLM-5.2 (753B DeepSeek-Sparse-Attention MoE) on 8x A100 80GB with vLLM — TRITON_MLA_SPARSE backend (PR #38476), no-recompile install, benchmarks

A developer confirmed that GLM-5.2, a 753B-parameter DeepSeek-Sparse-Attention MoE model, runs on 8x A100 80GB GPUs using vLLM PR #38476, which adds a Triton sparse-MLA backend for Ampere architectures. The setup achieved ~56 tok/s single-stream and ~625 tok/s aggregate decode (32-way) with AWQ-INT4 quantization, requiring only a Python-only cherry-pick without CUDA recompilation.

read4 min views13 publishedJun 20, 2026

TL;DR. GLM-5.2 (glm_moe_dsa

— DeepSeek Sparse Attention) does not run on Ampere (A100, sm_80) with stock vLLM: the sparse-MLA attention backend (FLASHMLA_SPARSE

) and the lightning-indexer's fp8_mqa_logits

(DeepGEMM) are Hopper/Blackwell-only. vLLM PR #38476 (issue #38006) adds a Triton sparse-MLA backend (TRITON_MLA_SPARSE

) + a bf16 Triton indexer fallback that run on Ampere. Cherry-picking it onto current main

is a Python-only change — no CUDA recompile. Result: GLM-5.2 AWQ-INT4 serves on 8× A100 at ~56 tok/s single-stream and ~625 tok/s aggregate decode (32-way), with coherent output.

This is an independent 8× A100 confirmation of PR #38476 (the author validated on 32× A100), plus a no-recompile install note. Credit to @haosdent for the PR.

  • 8× A100 80GB (sm_80). ~410 GiB VRAM used at TP=8, so all 8 GPUs.
  • A recent vLLM main

(≈ 0.23.1rc1 era),torch

+triton

matching that build (this was tested with torch 2.11 / triton 3.6), anduv

(or pip). - ~440 GB free disk for the weights.

hf download cyankiwi/GLM-5.2-AWQ-INT4    # ~440 GB, compressed-tensors INT4 (Marlin)
git clone https://github.com/vllm-project/vllm && cd vllm

gh pr checkout 38476

Notes when cherry-picking onto recent main

:

One conflict, invllm/model_executor/layers/sparse_attn_indexer.py

.main

added an XPU dispatch branch the PR's base lacked. Resolve to a three-way dispatch:if is_xpu() → elif use_deep_gemm → else <triton fallback>

. In the indexer__init__

, replacemain

's hardRuntimeError

(when DeepGEMM is missing) with the PR's warn-and-fallback onnot is_deep_gemm_supported()

— that's what routes Ampere to the Triton path instead of aborting.Drop the PR'scsrc/.../fp4/nvfp4_quant_entry.cu

stub (a SM100/MXFP4 link shim made obsolete by an upstream refactor). Removing it keeps the changesetPython + docs only(~10 files): the newTRITON_MLA_SPARSE

backend,ops/mqa_logits_triton.py

,ops/triton_mla_sparse_kernel.py

, and registration inattention/backends/registry.py

+platforms/cuda.py

.

Because the changeset is Python-only, reuse vLLM's precompiled extension instead of building:

uv venv --python 3.12 .venv-glm52 && source .venv-glm52/bin/activate
VLLM_USE_PRECOMPILED=1 uv pip install -e .   # editable install over the prebuilt wheel — no nvcc build

Sanity check on an A100:

import vllm; print(vllm.__version__)
from vllm.platforms import current_platform
from vllm.utils.deep_gemm import is_deep_gemm_supported
print(is_deep_gemm_supported())   # -> False on A100 (so the Triton fallback is used)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7   # all 8 GPUs: the 410 GB INT4 weights need TP=8
VLLM_ATTENTION_BACKEND=TRITON_MLA_SPARSE \
vllm serve cyankiwi/GLM-5.2-AWQ-INT4 \
  --tensor-parallel-size 8 \
  --no-async-scheduling \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --trust-remote-code \
  --kv-cache-dtype auto \        # bf16 KV — do NOT use fp8 KV on Ampere
  --port 8000

Confirm in the startup log (this is the proof it's on the sparse path, not a dense fallback):

[cuda.py] Using TRITON_MLA_SPARSE attention backend out of potential backends: ['TRITON_MLA_SPARSE']
[sparse_attn_indexer.py] DeepGEMM not supported on this platform; using Triton fallback for sparse attention indexer
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "cyankiwi/GLM-5.2-AWQ-INT4",
  "messages": [{"role":"user","content":"What is 17 multiplied by 24? Explain briefly."}],
  "max_tokens": 300}'

→ coherent reasoning ending in 408. (GLM-5.2 is a reasoning model: it emits <think>

chain-of-thought, then the final answer.)

Concurrency Aggregate decode tok/s Total tok/s (prompt+gen)
1 56 282
4 186 928
8 335 1,674
16 548 2,740
32 625
3,127

Single-stream decode ~56 tok/s; aggregate decode scales to ~625 tok/s. For reference, llama.cpp serving the GGUF Q4 of the same model (it also implements glm-dsa

) tops out around ~70 tok/s aggregate decode — so vLLM is ~2.3× single-stream and ~9× in aggregate. Cold start is ~7 min (410 GB weights + CUDA-graph capture).

All 8 GPUs must be visible(CUDA_VISIBLE_DEVICES

) — TP=8 is required for the 410 GB weights.Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND

is benign — it's still honored (the backend is also auto-selected on sm_80).- sm_80 disables SymmMem → falls back to CUSTOM/PyNccl all-reduce (benign).

  • One-time Triton JIT compile on the first request — send a warm-up request.
  • Use bf16 KV cache(--kv-cache-dtype auto

); fp8 KV is not supported on Ampere here. - Prompts longer than --max-model-len

(32k above) need a larger window (costs concurrency, since MLA KV grows with context) or chunking.

── more in #large-language-models 4 stories · sorted by recency
── more on @glm-5.2 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/running-glm-5-2-753b…] indexed:0 read:4min 2026-06-20 ·