Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

NVIDIA released Qwen3.6-35B-A3B-NVFP4, a post-training FP4-quantized variant of Alibaba's 35B MoE model that fits on a single H100 by reducing VRAM from ~71 GB to ~23 GB. The quantization targets weights and activations of linear operators, achieving a 3.06× reduction with under 1-point accuracy loss on standard benchmarks. However, the format requires Hopper or Blackwell hardware and does not run on A100 or consumer GPUs.

NVIDIA published nvidia/Qwen3.6-35B-A3B-NVFP4 https://huggingface.co/nvidia/Qwen3.6-35B-A3B-NVFP4 on May 28, 2026 — a post-training FP4-quantized variant of Alibaba's 35B MoE model that fits on a single H100 by cutting VRAM from ~71 GB to ~23 GB. If you're on an A100 or consumer GPU, jump to the gotchas section first — this quantization format does not run on your hardware. NVFP4 quantization targets the weights and activations of linear operators inside transformer and MoE blocks specifically — LayerNorms, embeddings, and biases stay in BF16/F32 for numerical stability . The selective 4-bit compression yields a 3.06× reduction in disk footprint and VRAM versus the BF16 base, dropping from roughly 71 GB to ~23 GB equivalent on Hopper hardware . Quick Answer: nvidia/Qwen3.6-35B-A3B-NVFP4 fits a 35B MoE reasoning model on a single H100 by applying 4-bit quantization to linear operator weights and activations, reducing VRAM from ~71 GB to ~23 GB 3.06× with under 1-point accuracy loss on standard benchmarks. Hopper or Blackwell required — A100 and RTX 4090 lack FP4 compute paths entirely. The calibration pipeline used two datasets: cnn dailymail 300K+ English news articles and NVIDIA's Nemotron-Post-Training-Dataset-v2 for multi-turn dialogue coverage, processed with NVIDIA Model Optimizer v0.44.0 . The dual-dataset approach is worth noting: a quantization calibrated only on news articles would likely regress on structured, multi-turn instruction-following — and the benchmark results bear that out. NVIDIA's official eval suite shows the accuracy gap is narrow. NVFP4 stays within 0.5–0.8 points of BF16 across reasoning benchmarks, and marginally outperforms on instruction-following and multimodal tasks : | Benchmark | BF16 | NVFP4 | Delta | |---|---|---|---| | MMLU Pro | 85.6 | 85.0 | −0.6 | | GPQA Diamond | 84.9 | 84.8 | −0.1 | | AIME 2025 | 89.2 | 88.8 | −0.4 | | τ²-Bench Telecom | 95.5 | 94.7 | −0.8 | | SciCode | 40.8 | 40.6 | −0.2 | | IFBench | 62.3 | 62.8 | +0.5 | | MMMU Pro | 74.1 | 74.5 | +0.4 | "The NVFP4 quantized model achieves nearly identical accuracy to the BF16 original while reducing memory requirements by 3.06×, enabling deployment on hardware that would otherwise require tensor parallelism across multiple GPUs." — NVIDIA Model Optimization Team, nvidia/Qwen3.6-35B-A3B-NVFP4 model card FP4 tensor core execution paths exist only on Hopper H100, H200 and Blackwell GB200, GB300, DGX Spark GB10 architectures . The RTX 4090 Ada Lovelace, sm 89 , RTX 5090, and A100 Ampere, sm 80 have no native FP4 compute units. Passing --quantization modelopt on those cards will produce an error at load time or, worse, silently wrong output. Your fallback options on non-Hopper/Blackwell hardware: DGX Spark Blackwell, sm 120/121a is officially supported but needs extra setup: CUDA 13.0 and the vllm/vllm-openai:cu130-nightly Docker image . Stable vLLM releases do not yet include the FlashInfer CUTLASS MoE kernels for that architecture. Verify your vLLM build has compressed-tensors NVFP4 support before attempting to serve — a mismatched build will silently fall back or crash at model load. The minimum viable Hopper command. Two flags matter here: --quantization modelopt activates NVIDIA Model Optimizer's compressed-tensors backend, and --reasoning-parser qwen3 strips <think ...</think chain-of-thought blocks from API responses so callers see clean completions: vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \ --port 8000 \ --quantization modelopt \ --max-model-len 262144 \ --reasoning-parser qwen3 DGX Spark Blackwell requires three environment variables set before launching. Omitting any of them causes a FlashInfer MoE kernel mismatch at startup — the error message is not always explicit about which variable is missing, so set all three : export VLLM USE FLASHINFER MOE FP4=0 export VLLM FP8 MOE BACKEND=flashinfer cutlass export FLASHINFER DISABLE VERSION CHECK=1 vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \ --quantization modelopt \ --kv-cache-dtype fp8 \ --attention-backend flashinfer \ --moe-backend marlin \ --gpu-memory-utilization 0.85 \ --max-model-len 65536 \ --max-num-seqs 4 \ --enable-chunked-prefill \ --enable-prefix-caching \ --speculative-config '{"method":"mtp","num speculative tokens":3,"moe backend":"triton"}' Flag-by-flag breakdown: --kv-cache-dtype fp8 — halves KV-cache memory versus BF16, directly enabling longer usable context at 0.85 VRAM utilization --moe-backend marlin — selects the Marlin MoE kernel for Blackwell; the default selection may not be optimal on this architecture --max-num-seqs 4 — keeps total concurrent sequence memory predictable on constrained VRAM; raise cautiously and watch OOM behavior --enable-chunked-prefill — required on DGX Spark; without it, long prompts OOM well before the 65536-token cap --enable-prefix-caching — reduces time-to-first-token for repeated system prompts in multi-turn chat workloads --speculative-config '{"method":"mtp",...}' — enables the built-in Multi-Token Prediction head; no separate draft model required or loadedThe snippet below illustrative — not executed; running it requires a CUDA-enabled environment with transformers installed shows how to verify your GPU is Hopper-class before attempting to load the model. The major < 9 check is the key gate: H100 reports sm 90, A100 reports sm 80: python import torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL = "Qwen/Qwen3.6-35B-NVFP4" if not torch.cuda.is available : raise SystemExit "CUDA GPU required" name = torch.cuda.get device name 0 major, minor = torch.cuda.get device capability 0 print f"GPU: {name} sm {major}{minor} " if major < 9: raise SystemExit "NVFP4 path requires H100-class sm 90+; A100 is sm 80" tok = AutoTokenizer.from pretrained MODEL model = AutoModelForCausalLM.from pretrained MODEL, device map={"": 0} print model.generate tok "Qwen NVFP4 fits on one H100 because", return tensors="pt" .to 0 , max new tokens=8 An A100 hits the SystemExit before wasting time on a multi-minute model download. Run this check before provisioning storage or bandwidth for the weights. Four failure modes worth knowing before you lose an hour to a non-obvious error: --quantization modelopt on A100 or RTX hardware VLLM FP8 MOE BACKEND or FLASHINFER DISABLE VERSION CHECK before launch triggers a FlashInfer MoE kernel mismatch. The startup error does not always name the specific missing variable — set all three unconditionally before touching vLLM on Blackwell. --reasoning-parser qwen3 <think ...</think blocks in every completion response. Clients parsing JSON completions will see malformed output; streaming clients will surface the thinking chain directly to end users. This flag is not optional. --enable-chunked-prefill on DGX SparkOne operational caveat for DGX Spark production: the vllm/vllm-openai:cu130-nightly image is not a stable release . Pin to a specific build hash for any deployment you need to reproduce, or wait for a stable vLLM release that includes full NVFP4 Blackwell support upstream. The built-in MTP speculative decoding head achieves an 85.4% token acceptance rate at single-user baseline 512-token outputs , rising to 92.8% at 4,096-token outputs . No second draft model to load or manage — the MTP head is baked into the base checkpoint. At concurrency 1, output throughput is 55.9 tokens/s; at concurrency 32, it scales to 433.4 tokens/s . The community AEON-7 DFlash variant reports 117 tok/s greedy decoding on DGX Spark with 62–78% draft acceptance and 2.7–4.4 mean accepted tokens per target step . The native context window is 131K tokens, extended to 262,144 via RoPE scaling . On DGX Spark, cap --max-model-len at 65536 to stay within safe VRAM margins at 0.85 utilization. The full 262K context is accessible on H100/H200 with more VRAM headroom. Note that long-context RAG quality under NVFP4 versus BF16 at the 262K limit has not been independently benchmarked as of June 2026 — treat that range as best-effort until data appears. The same vLLM endpoint handles image and video inputs alongside text once the server is running on Hopper or Blackwell — no additional flags needed for multimodal prompts. Multimodal inference quality under NVFP4 quantization is also unbenchmarked publicly, so evaluate against your specific workload rather than relying on text benchmark results as a proxy. No. FP4 tensor core paths require Hopper H100, H200 or Blackwell GB200, GB300, DGX Spark architecture. The RTX 4090 is Ada Lovelace sm 89 and the A100 is Ampere sm 80 — neither has native FP4 compute units. On those cards, use community GGUF quantizations via llama.cpp or the BF16 base model if you have 71+ GB VRAM available RTX PRO 6000 96 GB or H100/A100 80 GB . --quantization modelopt actually do? It tells vLLM to route weight loading through NVIDIA Model Optimizer's compressed-tensors backend, which understands the NVFP4 format and dispatches matrix multiplications through FP4 tensor cores. Without this flag, vLLM will not recognize the quantization scheme and will either throw an error at startup or attempt to interpret the weights as a different format — neither produces usable output. 0.5–0.8 points on most benchmarks per NVIDIA's official eval suite . MMLU Pro drops from 85.6 to 85.0; GPQA Diamond from 84.9 to 84.8; AIME 2025 from 89.2 to 88.8. On instruction-following IFBench: 62.8 vs 62.3 and multimodal reasoning MMMU Pro: 74.5 vs 74.1 , NVFP4 marginally outperforms BF16 — likely a calibration dataset effect from the multi-turn Nemotron data. No. The Multi-Token Prediction head is embedded in the Qwen3.6-35B-A3B checkpoint itself. Pass --speculative-config '{"method":"mtp","num speculative tokens":3,"moe backend":"triton"}' to activate it — vLLM uses the model's own MTP head without downloading or loading a second checkpoint. CUDA 13.0 and the vllm/vllm-openai:cu130-nightly Docker image . Current stable vLLM releases lack FlashInfer CUTLASS MoE kernels for Blackwell sm 120/121a. Pin to a specific nightly build hash for any production deployment — a stable vLLM release with full NVFP4 Blackwell support had not shipped as of June 2026. On a Hopper card, the path is now practical: one vllm serve command with --quantization modelopt and --reasoning-parser qwen3 , and you have a 35B reasoning model with 262K context, built-in chain-of-thought handling, and native tool calling — on a single GPU. The 3.06× memory reduction is the operational threshold between needing four-way tensor parallelism and fitting on one card. Extend the baseline from here: add --enable-auto-tool-choice --tool-call-parser qwen3 for structured tool calling in agent workloads; toggle thinking mode off for latency-sensitive paths with --default-chat-template-kwargs '{"enable thinking": false}' ; stress-test the 262K RAG path against your actual document lengths. A RedHatAI mirror https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4 is also on Hugging Face for enterprise environments with registry requirements. On DGX Spark: the nightly image dependency is the main operational risk. Track the AEON-7/Qwen3.6-NVFP4-DFlash https://github.com/AEON-7/Qwen3.6-NVFP4-DFlash repository for community patch status and watch upstream vLLM releases for when Blackwell sm 120/121a kernels land in a stable build. Until then, pin your nightly image hash. Last updated: 2026-06-01. Based on the nvidia/Qwen3.6-35B-A3B-NVFP4 model card released 2026-05-28 and community deployment reports reviewed as of June 2026.