Qwen 3.6-35B-A3B FP8 (MoE, 3B active) on DGX Spark GB10 A developer deployed the Qwen 3.6-35B-A3B FP8 mixture-of-experts model (3 billion active parameters) on a DGX Spark GB10 system using vLLM, achieving inference with a 262,144-token context window and FP8 KV cache. The configuration includes speculative decoding with two extra tokens and a GPU memory utilization of 0.7069 to maximize the KV cache budget. The setup runs as a Docker container exposing an OpenAI-compatible API on port 8888. | Qwen 3.6-35B-A3B FP8 MoE, 3B active on DGX Spark GB10 | | | API: http://localhost:8888/v1 | | | services: | | | vllm-qwen36-35b: | | | image: vllm/vllm-openai:v0.20.0-aarch64-cu130-ubuntu2404 | | | container name: qwen36-35b-vllm | | | runtime: nvidia | | | ipc: host | | | shm size: "64gb" | | | ulimits: | | | memlock: -1 | | | stack: 67108864 | | | environment: | | | - NVIDIA VISIBLE DEVICES=all | | | - HF TOKEN=${HF TOKEN} | | | - CUDA MANAGED FORCE DEVICE ALLOC=1 | | | - PYTORCH CUDA ALLOC CONF=expandable segments:True | | | - VLLM MARLIN USE ATOMIC ADD=1 | | | - TRITON PTXAS PATH=/usr/local/cuda/bin/ptxas | | | - OMP NUM THREADS=4 | | | volumes: | | | - ${HOME}/.cache/huggingface:/root/.cache/huggingface | | | - ${HOME}/.cache/vllm:/root/.cache/vllm | | | ports: | | | - "8888:8000" | | | command: | | | - "--model" | | | - "Qwen/Qwen3.6-35B-A3B-FP8" | | | - "--served-model-name" | | | - "qwen3.6-35b-a3b" | | | - "--host" | | | - "0.0.0.0" | | | - "--port" | | | - "8000" | | | - "--attention-backend" | | | - "flashinfer" | | | - "--max-model-len" | | | - "262144" | | | 0.20.0+ CUDA-graph memory profiling shaves ~1pp; 0.7069 restores the pre-0.20.0 0.70 KV budget. | | | - "--gpu-memory-utilization" | | | - "0.7069" | | | - "--kv-cache-dtype" | | | - "fp8" | | | - "--max-num-seqs" | | | - "20" | | | - "--max-num-batched-tokens" | | | - "32768" | | | - "--enable-prefix-caching" | | | - "--enable-auto-tool-choice" | | | - "--tool-call-parser" | | | - "qwen3 coder" | | | - "--reasoning-parser" | | | - "qwen3" | | | - "--speculative-config" | | | - '{"method":"qwen3 next mtp","num speculative tokens":2}' | | | - "--trust-remote-code" | | | deploy: | | | resources: | | | reservations: | | | devices: | | | - driver: nvidia | | | count: all | | | capabilities: gpu | | | limits: | | | memory: 100g | | | healthcheck: | | | test: "CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen 'http://localhost:8000/health' \"" | | | interval: 30s | | | timeout: 10s | | | retries: 30 | | | start period: 900s | | | restart: unless-stopped |