{"slug": "running-glm-5-2-753b-deepseek-sparse-attention-moe-on-8x-a100-80gb-with-vllm-mla", "title": "Running GLM-5.2 (753B DeepSeek-Sparse-Attention MoE) on 8x A100 80GB with vLLM — TRITON_MLA_SPARSE backend (PR #38476), no-recompile install, benchmarks", "summary": "A developer confirmed that GLM-5.2, a 753B-parameter DeepSeek-Sparse-Attention MoE model, runs on 8x A100 80GB GPUs using vLLM PR #38476, which adds a Triton sparse-MLA backend for Ampere architectures. The setup achieved ~56 tok/s single-stream and ~625 tok/s aggregate decode (32-way) with AWQ-INT4 quantization, requiring only a Python-only cherry-pick without CUDA recompilation.", "body_md": "**TL;DR.** GLM-5.2 (`glm_moe_dsa`\n\n— DeepSeek Sparse Attention) does **not** run on Ampere (A100, sm_80) with stock vLLM: the sparse-MLA attention backend (`FLASHMLA_SPARSE`\n\n) and the lightning-indexer's `fp8_mqa_logits`\n\n(DeepGEMM) are Hopper/Blackwell-only. vLLM **PR #38476** ([issue #38006](https://github.com/vllm-project/vllm/issues/38006)) adds a Triton sparse-MLA backend (`TRITON_MLA_SPARSE`\n\n) + a bf16 Triton indexer fallback that run on Ampere. Cherry-picking it onto current `main`\n\nis a **Python-only** change — no CUDA recompile. Result: **GLM-5.2 AWQ-INT4 serves on 8× A100 at ~56 tok/s single-stream and ~625 tok/s aggregate decode (32-way), with coherent output.**\n\nThis is an independent **8× A100** confirmation of PR #38476 (the author validated on 32× A100), plus a no-recompile install note. Credit to **@haosdent** for the PR.\n\n- 8× A100 80GB (sm_80). ~410 GiB VRAM used at TP=8, so all 8 GPUs.\n- A recent vLLM\n`main`\n\n(≈ 0.23.1rc1 era),`torch`\n\n+`triton`\n\nmatching that build (this was tested with torch 2.11 / triton 3.6), and`uv`\n\n(or pip). - ~440 GB free disk for the weights.\n\n```\nhf download cyankiwi/GLM-5.2-AWQ-INT4    # ~440 GB, compressed-tensors INT4 (Marlin)\ngit clone https://github.com/vllm-project/vllm && cd vllm\n\n# Simplest: check out the PR branch directly (gh fetches it for you)\ngh pr checkout 38476\n\n# Or, to put it on top of current main (what we did): the PR commit is NOT on\n# main, so fetch it explicitly first — otherwise the cherry-pick errors with\n# \"fatal: bad revision\".\n#   git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow origin\n#   git fetch origin pull/38476/head:pr-38476\n#   git cherry-pick pr-38476        # then resolve the one conflict below\n```\n\nNotes when cherry-picking onto recent `main`\n\n:\n\n**One conflict**, in`vllm/model_executor/layers/sparse_attn_indexer.py`\n\n.`main`\n\nadded an XPU dispatch branch the PR's base lacked. Resolve to a three-way dispatch:`if is_xpu() → elif use_deep_gemm → else <triton fallback>`\n\n. In the indexer`__init__`\n\n, replace`main`\n\n's hard`RuntimeError`\n\n(when DeepGEMM is missing) with the PR's warn-and-fallback on`not is_deep_gemm_supported()`\n\n— that's what routes Ampere to the Triton path instead of aborting.**Drop** the PR's`csrc/.../fp4/nvfp4_quant_entry.cu`\n\nstub (a SM100/MXFP4 link shim made obsolete by an upstream refactor). Removing it keeps the changeset**Python + docs only**(~10 files): the new`TRITON_MLA_SPARSE`\n\nbackend,`ops/mqa_logits_triton.py`\n\n,`ops/triton_mla_sparse_kernel.py`\n\n, and registration in`attention/backends/registry.py`\n\n+`platforms/cuda.py`\n\n.\n\nBecause the changeset is Python-only, reuse vLLM's precompiled extension instead of building:\n\n```\nuv venv --python 3.12 .venv-glm52 && source .venv-glm52/bin/activate\n# install torch/triton matching your vLLM build, then:\nVLLM_USE_PRECOMPILED=1 uv pip install -e .   # editable install over the prebuilt wheel — no nvcc build\n```\n\nSanity check on an A100:\n\n``` python\nimport vllm; print(vllm.__version__)\nfrom vllm.platforms import current_platform\nfrom vllm.utils.deep_gemm import is_deep_gemm_supported\nprint(is_deep_gemm_supported())   # -> False on A100 (so the Triton fallback is used)\nexport CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7   # all 8 GPUs: the 410 GB INT4 weights need TP=8\nVLLM_ATTENTION_BACKEND=TRITON_MLA_SPARSE \\\nvllm serve cyankiwi/GLM-5.2-AWQ-INT4 \\\n  --tensor-parallel-size 8 \\\n  --no-async-scheduling \\\n  --gpu-memory-utilization 0.90 \\\n  --max-model-len 32768 \\\n  --trust-remote-code \\\n  --kv-cache-dtype auto \\        # bf16 KV — do NOT use fp8 KV on Ampere\n  --port 8000\n```\n\nConfirm in the startup log (this is the proof it's on the sparse path, not a dense fallback):\n\n```\n[cuda.py] Using TRITON_MLA_SPARSE attention backend out of potential backends: ['TRITON_MLA_SPARSE']\n[sparse_attn_indexer.py] DeepGEMM not supported on this platform; using Triton fallback for sparse attention indexer\ncurl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{\n  \"model\": \"cyankiwi/GLM-5.2-AWQ-INT4\",\n  \"messages\": [{\"role\":\"user\",\"content\":\"What is 17 multiplied by 24? Explain briefly.\"}],\n  \"max_tokens\": 300}'\n```\n\n→ coherent reasoning ending in **408**. (GLM-5.2 is a reasoning model: it emits `<think>`\n\nchain-of-thought, then the final answer.)\n\n| Concurrency | Aggregate decode tok/s | Total tok/s (prompt+gen) |\n|---|---|---|\n| 1 | 56 | 282 |\n| 4 | 186 | 928 |\n| 8 | 335 | 1,674 |\n| 16 | 548 | 2,740 |\n| 32 | 625 |\n3,127 |\n\nSingle-stream decode ~56 tok/s; aggregate decode scales to ~625 tok/s. For reference, **llama.cpp** serving the GGUF Q4 of the same model (it also implements `glm-dsa`\n\n) tops out around **~70 tok/s aggregate decode** — so vLLM is ~2.3× single-stream and ~9× in aggregate. Cold start is ~7 min (410 GB weights + CUDA-graph capture).\n\n**All 8 GPUs must be visible**(`CUDA_VISIBLE_DEVICES`\n\n) — TP=8 is required for the 410 GB weights.`Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND`\n\nis benign — it's still honored (the backend is also auto-selected on sm_80).- sm_80 disables SymmMem → falls back to CUSTOM/PyNccl all-reduce (benign).\n- One-time Triton JIT compile on the first request — send a warm-up request.\n- Use\n**bf16 KV cache**(`--kv-cache-dtype auto`\n\n); fp8 KV is not supported on Ampere here. - Prompts longer than\n`--max-model-len`\n\n(32k above) need a larger window (costs concurrency, since MLA KV grows with context) or chunking.", "url": "https://wpnews.pro/news/running-glm-5-2-753b-deepseek-sparse-attention-moe-on-8x-a100-80gb-with-vllm-mla", "canonical_source": "https://gist.github.com/timinar/c8d2eca4e2ea7d11db57a1e6e62d06a2", "published_at": "2026-06-20 19:13:29+00:00", "updated_at": "2026-06-24 01:14:21.661746+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "developer-tools"], "entities": ["GLM-5.2", "DeepSeek", "vLLM", "A100", "Triton", "AWQ", "haosdent", "cyankiwi"], "alternates": {"html": "https://wpnews.pro/news/running-glm-5-2-753b-deepseek-sparse-attention-moe-on-8x-a100-80gb-with-vllm-mla", "markdown": "https://wpnews.pro/news/running-glm-5-2-753b-deepseek-sparse-attention-moe-on-8x-a100-80gb-with-vllm-mla.md", "text": "https://wpnews.pro/news/running-glm-5-2-753b-deepseek-sparse-attention-moe-on-8x-a100-80gb-with-vllm-mla.txt", "jsonld": "https://wpnews.pro/news/running-glm-5-2-753b-deepseek-sparse-attention-moe-on-8x-a100-80gb-with-vllm-mla.jsonld"}}