{"slug": "qwen-3-6-35b-a3b-fp8-moe-3b-active-on-dgx-spark-gb10", "title": "Qwen 3.6-35B-A3B FP8 (MoE, 3B active) on DGX Spark GB10", "summary": "A developer deployed the Qwen 3.6-35B-A3B FP8 mixture-of-experts model (3 billion active parameters) on a DGX Spark GB10 system using vLLM, achieving inference with a 262,144-token context window and FP8 KV cache. The configuration includes speculative decoding with two extra tokens and a GPU memory utilization of 0.7069 to maximize the KV cache budget. The setup runs as a Docker container exposing an OpenAI-compatible API on port 8888.", "body_md": "| # Qwen 3.6-35B-A3B FP8 (MoE, 3B active) on DGX Spark GB10 | |\n| # API: http://localhost:8888/v1 | |\n| services: | |\n| vllm-qwen36-35b: | |\n| image: vllm/vllm-openai:v0.20.0-aarch64-cu130-ubuntu2404 | |\n| container_name: qwen36-35b-vllm | |\n| runtime: nvidia | |\n| ipc: host | |\n| shm_size: \"64gb\" | |\n| ulimits: | |\n| memlock: -1 | |\n| stack: 67108864 | |\n| environment: | |\n| - NVIDIA_VISIBLE_DEVICES=all | |\n| - HF_TOKEN=${HF_TOKEN} | |\n| - CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 | |\n| - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | |\n| - VLLM_MARLIN_USE_ATOMIC_ADD=1 | |\n| - TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas | |\n| - OMP_NUM_THREADS=4 | |\n| volumes: | |\n| - ${HOME}/.cache/huggingface:/root/.cache/huggingface | |\n| - ${HOME}/.cache/vllm:/root/.cache/vllm | |\n| ports: | |\n| - \"8888:8000\" | |\n| command: | |\n| - \"--model\" | |\n| - \"Qwen/Qwen3.6-35B-A3B-FP8\" | |\n| - \"--served-model-name\" | |\n| - \"qwen3.6-35b-a3b\" | |\n| - \"--host\" | |\n| - \"0.0.0.0\" | |\n| - \"--port\" | |\n| - \"8000\" | |\n| - \"--attention-backend\" | |\n| - \"flashinfer\" | |\n| - \"--max-model-len\" | |\n| - \"262144\" | |\n| # 0.20.0+ CUDA-graph memory profiling shaves ~1pp; 0.7069 restores the pre-0.20.0 0.70 KV budget. | |\n| - \"--gpu-memory-utilization\" | |\n| - \"0.7069\" | |\n| - \"--kv-cache-dtype\" | |\n| - \"fp8\" | |\n| - \"--max-num-seqs\" | |\n| - \"20\" | |\n| - \"--max-num-batched-tokens\" | |\n| - \"32768\" | |\n| - \"--enable-prefix-caching\" | |\n| - \"--enable-auto-tool-choice\" | |\n| - \"--tool-call-parser\" | |\n| - \"qwen3_coder\" | |\n| - \"--reasoning-parser\" | |\n| - \"qwen3\" | |\n| - \"--speculative-config\" | |\n| - '{\"method\":\"qwen3_next_mtp\",\"num_speculative_tokens\":2}' | |\n| - \"--trust-remote-code\" | |\n| deploy: | |\n| resources: | |\n| reservations: | |\n| devices: | |\n| - driver: nvidia | |\n| count: all | |\n| capabilities: [gpu] | |\n| limits: | |\n| memory: 100g | |\n| healthcheck: | |\n| test: [\"CMD-SHELL\", \"python3 -c \\\"import urllib.request; urllib.request.urlopen('http://localhost:8000/health')\\\"\"] | |\n| interval: 30s | |\n| timeout: 10s | |\n| retries: 30 | |\n| start_period: 900s | |\n| restart: unless-stopped |", "url": "https://wpnews.pro/news/qwen-3-6-35b-a3b-fp8-moe-3b-active-on-dgx-spark-gb10", "canonical_source": "https://gist.github.com/wshobson/d32c98a5537ca4d51c92fea7e54aef40", "published_at": "2026-05-01 17:05:58+00:00", "updated_at": "2026-05-29 22:12:32.745711+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "artificial-intelligence", "machine-learning"], "entities": ["Qwen", "DGX Spark GB10", "vLLM", "NVIDIA", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/qwen-3-6-35b-a3b-fp8-moe-3b-active-on-dgx-spark-gb10", "markdown": "https://wpnews.pro/news/qwen-3-6-35b-a3b-fp8-moe-3b-active-on-dgx-spark-gb10.md", "text": "https://wpnews.pro/news/qwen-3-6-35b-a3b-fp8-moe-3b-active-on-dgx-spark-gb10.txt", "jsonld": "https://wpnews.pro/news/qwen-3-6-35b-a3b-fp8-moe-3b-active-on-dgx-spark-gb10.jsonld"}}