19:13
2026-06-20
gist.github.com
large-language-models
Running GLM-5.2 (753B DeepSeek-Sparse-Attention MoE) on 8x A100 80GB with vLLM β TRITON_MLA_SPARSE backend (PR #38476), no-recompile install, benchmarks
A developer confirmed that GLM-5.2, a 753B-parameter DeepSeek-Sparse-Attention MoE model, runs on 8x A100 80GB GPUs using vLLM PR #38476, which adds a Triton sparse-MLA backend for Ampere architectureβ¦