Qwen 3.6-35B-A3B FP8 (MoE, 3B active) on DGX Spark GB10

wpnews.pro

cd /news/large-language-models/qwen-3-6-35b-a3b-fp8-moe-3b-active-o… · home › topics › large-language-models › article

[ARTICLE · art-18143] src=gist.github.com ↗ pub=2026-05-01T17:05Z topic=large-language-models verified=true sentiment=· neutral

Qwen 3.6-35B-A3B FP8 (MoE, 3B active) on DGX Spark GB10

A developer deployed the Qwen 3.6-35B-A3B FP8 mixture-of-experts model (3 billion active parameters) on a DGX Spark GB10 system using vLLM, achieving inference with a 262,144-token context window and FP8 KV cache. The configuration includes speculative decoding with two extra tokens and a GPU memory utilization of 0.7069 to maximize the KV cache budget. The setup runs as a Docker container exposing an OpenAI-compatible API on port 8888.

read2 min views9 publishedMay 1, 2026

| # Qwen 3.6-35B-A3B FP8 (MoE, 3B active) on DGX Spark GB10 | |
| # API: http://localhost:8888/v1 | |

| services: | |

| vllm-qwen36-35b: | |
| image: vllm/vllm-openai:v0.20.0-aarch64-cu130-ubuntu2404 | |
| container_name: qwen36-35b-vllm | |

| - NVIDIA_VISIBLE_DEVICES=all | |
| - HF_TOKEN=${HF_TOKEN} | |

| - CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 | |

| - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | |
| - VLLM_MARLIN_USE_ATOMIC_ADD=1 | |

| - ${HOME}/.cache/huggingface:/root/.cache/huggingface | |
| - ${HOME}/.cache/vllm:/root/.cache/vllm | |

| - "--model" | |
| - "Qwen/Qwen3.6-35B-A3B-FP8" | |
| - "--served-model-name" | |
| - "qwen3.6-35b-a3b" | |
| - "--host" | |

| - "0.0.0.0" | | | - "--port" | | | - "8000" | | | - "--attention-backend" | | | - "flashinfer" | | | - "--max-model-len" | | | - "262144" | | | # 0.20.0+ CUDA-graph memory profiling shaves ~1pp; 0.7069 restores the pre-0.20.0 0.70 KV budget. | | | - "--gpu-memory-utilization" | | | - "0.7069" | | | - "--kv-cache-dtype" | | | - "fp8" | | | - "--max-num-seqs" | | | - "20" | | | - "--max-num-batched-tokens" | | | - "32768" | |

| - "--enable-prefix-caching" | |
| - "--enable-auto-tool-choice" | |
| - "--tool-call-parser" | |

| - "--speculative-config" | |
| - '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' | |
| - "--trust-remote-code" | |

source & further reading

gist.github.com — original article Claude skill: direct-response copywriting — Stan Leloup's Copywriting Mania methodology distilled (drop into .claude/skills/copywriting/SKILL.md) How to Connect Claude to Stock Market Data via MCP Effect repositories to use in opencode references.

~/api · this article 200

$curl api.wpnews.pro/v1/news/qwen-3-6-35b-a3b-fp8-moe…

Read original on gist.github.com → gist.github.com/wshobson/d32c98a5537ca4d51c92fea…

mentioned entities

Qwen

DGX Spark GB10

vLLM

NVIDIA

Hugging Face

metadata

slugqwen-3-6-35b-a3b-fp8-moe-3b-active-on-dgx-spark-gb10

topic#large-language-models

secondary4 topics

sentimentneutral

canonicalgist.github.com

navigation

← prevCatalyzing scientific impact thr…

next →Ubuntu infrastructure has been d…

── more in #large-language-models 4 stories · sorted by recency

github.com · 17 Jul · #large-language-models

Show HN: Qwen3.6-35B-A3B on a 16 GB M1 Pro with SSD-streamed MoE

huggingface.co · 17 Jul · #large-language-models

Fine-tune video and image models at scale with NVIDIA NeMo Automodel and 🤗 Diffusers

netflixtechblog.medium.com · 17 Jul · #large-language-models

In-House LLM Serving at Netflix

github.com · 17 Jul · #large-language-models

Show HN: Velora – On-device macOS dictation (Whisper and a local LLM, no cloud)

── more on @qwen 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #large-language-models

Gemini 3.5 Pro Delayed to July 17: Architectural Rebuild Explained

wpnews · 8 Jul · #ai-chips

D-Matrix launches Corsair AI inference platform, challenging Nvidia’s GPU dominance

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required