cd /news/large-language-models/qwen-3-6-35b-a3b-fp8-moe-3b-active-o… · home topics large-language-models article
[ARTICLE · art-18143] src=gist.github.com pub= topic=large-language-models verified=true sentiment=· neutral

Qwen 3.6-35B-A3B FP8 (MoE, 3B active) on DGX Spark GB10

A developer deployed the Qwen 3.6-35B-A3B FP8 mixture-of-experts model (3 billion active parameters) on a DGX Spark GB10 system using vLLM, achieving inference with a 262,144-token context window and FP8 KV cache. The configuration includes speculative decoding with two extra tokens and a GPU memory utilization of 0.7069 to maximize the KV cache budget. The setup runs as a Docker container exposing an OpenAI-compatible API on port 8888.

read2 min publishedMay 1, 2026
| # Qwen 3.6-35B-A3B FP8 (MoE, 3B active) on DGX Spark GB10 | |
| # API: http://localhost:8888/v1 | |

| services: | |

| vllm-qwen36-35b: | |
| image: vllm/vllm-openai:v0.20.0-aarch64-cu130-ubuntu2404 | |
| container_name: qwen36-35b-vllm | |

| runtime: nvidia | | | ipc: host | | | shm_size: "64gb" | | | ulimits: | | | memlock: -1 | | | stack: 67108864 | | | environment: | |

| - NVIDIA_VISIBLE_DEVICES=all | |
| - HF_TOKEN=${HF_TOKEN} | |

| - CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 | |

| - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | |
| - VLLM_MARLIN_USE_ATOMIC_ADD=1 | |

| - TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas | | | - OMP_NUM_THREADS=4 | | | volumes: | |

| - ${HOME}/.cache/huggingface:/root/.cache/huggingface | |
| - ${HOME}/.cache/vllm:/root/.cache/vllm | |

| ports: | | | - "8888:8000" | | | command: | |

| - "--model" | |
| - "Qwen/Qwen3.6-35B-A3B-FP8" | |
| - "--served-model-name" | |
| - "qwen3.6-35b-a3b" | |
| - "--host" | |

| - "0.0.0.0" | | | - "--port" | | | - "8000" | | | - "--attention-backend" | | | - "flashinfer" | | | - "--max-model-len" | | | - "262144" | | | # 0.20.0+ CUDA-graph memory profiling shaves ~1pp; 0.7069 restores the pre-0.20.0 0.70 KV budget. | | | - "--gpu-memory-utilization" | | | - "0.7069" | | | - "--kv-cache-dtype" | | | - "fp8" | | | - "--max-num-seqs" | | | - "20" | | | - "--max-num-batched-tokens" | | | - "32768" | |

| - "--enable-prefix-caching" | |
| - "--enable-auto-tool-choice" | |
| - "--tool-call-parser" | |

| - "qwen3_coder" | | | - "--reasoning-parser" | | | - "qwen3" | |

| - "--speculative-config" | |
| - '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' | |
| - "--trust-remote-code" | |

| deploy: | | | resources: | | | reservations: | | | devices: | | | - driver: nvidia | | | count: all | | | capabilities: [gpu] | | | limits: | | | memory: 100g | | | healthcheck: | | | test: ["CMD-SHELL", "python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')""] | | | interval: 30s | | | timeout: 10s | | | retries: 30 | | | start_period: 900s | | | restart: unless-stopped |

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/qwen-3-6-35b-a3b-fp8…] indexed:0 read:2min 2026-05-01 ·