Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out NVIDIA released Qwen3.6-35B-A3B-NVFP4, a post-training FP4-quantized variant of Alibaba's 35B MoE model that fits on a single H100 by reducing VRAM from ~71 GB to ~23 GB. The quantization targets weights and activations of linear operators, achieving a 3.06× reduction with under 1-point accuracy loss on standard benchmarks. However, the format requires Hopper or Blackwell hardware and does not run on A100 or consumer GPUs. NVIDIA published nvidia/Qwen3.6-35B-A3B-NVFP4 https://huggingface.co/nvidia/Qwen3.6-35B-A3B-NVFP4 on May 28, 2026 — a post-training FP4-quantized variant of Alibaba's 35B MoE model that fits on a single H100 by cutting VRAM from ~71 GB to ~23 GB. If you're on an A100 or consumer GPU, jump to the gotchas section first — this quantization format does not run on your hardware. NVFP4 quantization targets the weights and activations of linear operators inside transformer and MoE blocks specifically — LayerNorms, embeddings, and biases stay in BF16/F32 for numerical stability . The selective 4-bit compression yields a 3.06× reduction in disk footprint and VRAM versus the BF16 base, dropping from roughly 71 GB to ~23 GB equivalent on Hopper hardware . Quick Answer: nvidia/Qwen3.6-35B-A3B-NVFP4 fits a 35B MoE reasoning model on a single H100 by applying 4-bit quantization to linear operator weights and activations, reducing VRAM from ~71 GB to ~23 GB 3.06× with under 1-point accuracy loss on standard benchmarks. Hopper or Blackwell required — A100 and RTX 4090 lack FP4 compute paths entirely. The calibration pipeline used two datasets: cnn dailymail 300K+ English news articles and NVIDIA's Nemotron-Post-Training-Dataset-v2 for multi-turn dialogue coverage, processed with NVIDIA Model Optimizer v0.44.0 . The dual-dataset approach is worth noting: a quantization calibrated only on news articles would likely regress on structured, multi-turn instruction-following — and the benchmark results bear that out. NVIDIA's official eval suite shows the accuracy gap is narrow. NVFP4 stays within 0.5–0.8 points of BF16 across reasoning benchmarks, and marginally outperforms on instruction-following and multimodal tasks : | Benchmark | BF16 | NVFP4 | Delta | |---|---|---|---| | MMLU Pro | 85.6 | 85.0 | −0.6 | | GPQA Diamond | 84.9 | 84.8 | −0.1 | | AIME 2025 | 89.2 | 88.8 | −0.4 | | τ²-Bench Telecom | 95.5 | 94.7 | −0.8 | | SciCode | 40.8 | 40.6 | −0.2 | | IFBench | 62.3 | 62.8 | +0.5 | | MMMU Pro | 74.1 | 74.5 | +0.4 | "The NVFP4 quantized model achieves nearly identical accuracy to the BF16 original while reducing memory requirements by 3.06×, enabling deployment on hardware that would otherwise require tensor parallelism across multiple GPUs." — NVIDIA Model Optimization Team, nvidia/Qwen3.6-35B-A3B-NVFP4 model card FP4 tensor core execution paths exist only on Hopper H100, H200 and Blackwell GB200, GB300, DGX Spark GB10 architectures . The RTX 4090 Ada Lovelace, sm 89 , RTX 5090, and A100 Ampere, sm 80 have no native FP4 compute units. Passing --quantization modelopt on those cards will produce an error at load time or, worse, silently wrong output. Your fallback options on non-Hopper/Blackwell hardware: DGX Spark Blackwell, sm 120/121a is officially supported but needs extra setup: CUDA 13.0 and the vllm/vllm-openai:cu130-nightly Docker image . Stable vLLM releases do not yet include the FlashInfer CUTLASS MoE kernels for that architecture. Verify your vLLM build has compressed-tensors NVFP4 support before attempting to serve — a mismatched build will silently fall back or crash at model load. The minimum viable Hopper command. Two flags matter here: --quantization modelopt activates NVIDIA Model Optimizer's compressed-tensors backend, and --reasoning-parser qwen3 strips