# Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

> Source: <https://dev.to/creeta/qwen36-35b-nvfp4-runs-on-one-h100-a100-owners-are-out-e60>
> Published: 2026-06-18 10:37:32+00:00

NVIDIA published [nvidia/Qwen3.6-35B-A3B-NVFP4](https://huggingface.co/nvidia/Qwen3.6-35B-A3B-NVFP4) on May 28, 2026 — a post-training FP4-quantized variant of Alibaba's 35B MoE model that fits on a single H100 by cutting VRAM from ~71 GB to ~23 GB. If you're on an A100 or consumer GPU, jump to the gotchas section first — this quantization format does not run on your hardware.

NVFP4 quantization targets the weights and activations of linear operators inside transformer and MoE blocks specifically — LayerNorms, embeddings, and biases stay in BF16/F32 for numerical stability . The selective 4-bit compression yields a **3.06× reduction** in disk footprint and VRAM versus the BF16 base, dropping from roughly 71 GB to ~23 GB equivalent on Hopper hardware .

**Quick Answer:** nvidia/Qwen3.6-35B-A3B-NVFP4 fits a 35B MoE reasoning model on a single H100 by applying 4-bit quantization to linear operator weights and activations, reducing VRAM from ~71 GB to ~23 GB (3.06×) with under 1-point accuracy loss on standard benchmarks. Hopper or Blackwell required — A100 and RTX 4090 lack FP4 compute paths entirely.

The calibration pipeline used two datasets: `cnn_dailymail`

(300K+ English news articles) and NVIDIA's `Nemotron-Post-Training-Dataset-v2`

for multi-turn dialogue coverage, processed with NVIDIA Model Optimizer v0.44.0 . The dual-dataset approach is worth noting: a quantization calibrated only on news articles would likely regress on structured, multi-turn instruction-following — and the benchmark results bear that out.

NVIDIA's official eval suite shows the accuracy gap is narrow. NVFP4 stays within 0.5–0.8 points of BF16 across reasoning benchmarks, and marginally outperforms on instruction-following and multimodal tasks :

| Benchmark | BF16 | NVFP4 | Delta |
|---|---|---|---|
| MMLU Pro | 85.6 | 85.0 | −0.6 |
| GPQA Diamond | 84.9 | 84.8 | −0.1 |
| AIME 2025 | 89.2 | 88.8 | −0.4 |
| τ²-Bench Telecom | 95.5 | 94.7 | −0.8 |
| SciCode | 40.8 | 40.6 | −0.2 |
| IFBench | 62.3 | 62.8 | +0.5 |
| MMMU Pro | 74.1 | 74.5 | +0.4 |

"The NVFP4 quantized model achieves nearly identical accuracy to the BF16 original while reducing memory requirements by 3.06×, enabling deployment on hardware that would otherwise require tensor parallelism across multiple GPUs." — NVIDIA Model Optimization Team,

[nvidia/Qwen3.6-35B-A3B-NVFP4 model card]

FP4 tensor core execution paths exist only on Hopper (H100, H200) and Blackwell (GB200, GB300, DGX Spark GB10) architectures . The RTX 4090 (Ada Lovelace, sm_89), RTX 5090, and A100 (Ampere, sm_80) have no native FP4 compute units. Passing `--quantization modelopt`

on those cards will produce an error at load time or, worse, silently wrong output.

Your fallback options on non-Hopper/Blackwell hardware:

DGX Spark (Blackwell, sm_120/121a) is officially supported but needs extra setup: CUDA 13.0 and the `vllm/vllm-openai:cu130-nightly`

Docker image . Stable vLLM releases do not yet include the FlashInfer CUTLASS MoE kernels for that architecture. Verify your vLLM build has compressed-tensors NVFP4 support before attempting to serve — a mismatched build will silently fall back or crash at model load.

The minimum viable Hopper command. Two flags matter here: `--quantization modelopt`

activates NVIDIA Model Optimizer's compressed-tensors backend, and `--reasoning-parser qwen3`

strips `<think>...</think>`

chain-of-thought blocks from API responses so callers see clean completions:

```
vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
  --port 8000 \
  --quantization modelopt \
  --max-model-len 262144 \
  --reasoning-parser qwen3
```

DGX Spark (Blackwell) requires three environment variables set before launching. Omitting any of them causes a FlashInfer MoE kernel mismatch at startup — the error message is not always explicit about which variable is missing, so set all three :

```
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_FP8_MOE_BACKEND=flashinfer_cutlass
export FLASHINFER_DISABLE_VERSION_CHECK=1

vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --moe-backend marlin \
  --gpu-memory-utilization 0.85 \
  --max-model-len 65536 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'
```

Flag-by-flag breakdown:

`--kv-cache-dtype fp8`

— halves KV-cache memory versus BF16, directly enabling longer usable context at 0.85 VRAM utilization`--moe-backend marlin`

— selects the Marlin MoE kernel for Blackwell; the default selection may not be optimal on this architecture`--max-num-seqs 4`

— keeps total concurrent sequence memory predictable on constrained VRAM; raise cautiously and watch OOM behavior`--enable-chunked-prefill`

— required on DGX Spark; without it, long prompts OOM well before the 65536-token cap`--enable-prefix-caching`

— reduces time-to-first-token for repeated system prompts in multi-turn chat workloads`--speculative-config '{"method":"mtp",...}'`

— enables the built-in Multi-Token Prediction head; no separate draft model required or loadedThe snippet below (illustrative — not executed; running it requires a CUDA-enabled environment with `transformers`

installed) shows how to verify your GPU is Hopper-class before attempting to load the model. The `major < 9`

check is the key gate: H100 reports sm_90, A100 reports sm_80:

``` python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL = "Qwen/Qwen3.6-35B-NVFP4"

if not torch.cuda.is_available():
    raise SystemExit("CUDA GPU required")

name = torch.cuda.get_device_name(0)
major, minor = torch.cuda.get_device_capability(0)
print(f"GPU: {name} (sm_{major}{minor})")
if major < 9:
    raise SystemExit("NVFP4 path requires H100-class sm_90+; A100 is sm_80")

tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, device_map={"": 0})
print(model.generate(**tok("Qwen NVFP4 fits on one H100 because", return_tensors="pt").to(0), max_new_tokens=8))
```

An A100 hits the `SystemExit`

before wasting time on a multi-minute model download. Run this check before provisioning storage or bandwidth for the weights.

Four failure modes worth knowing before you lose an hour to a non-obvious error:

`--quantization modelopt`

on A100 or RTX hardware`VLLM_FP8_MOE_BACKEND`

or `FLASHINFER_DISABLE_VERSION_CHECK`

before launch triggers a FlashInfer MoE kernel mismatch. The startup error does not always name the specific missing variable — set all three unconditionally before touching vLLM on Blackwell.`--reasoning-parser qwen3`

`<think>...</think>`

blocks in every completion response. Clients parsing JSON completions will see malformed output; streaming clients will surface the thinking chain directly to end users. This flag is not optional.`--enable-chunked-prefill`

on DGX SparkOne operational caveat for DGX Spark production: the `vllm/vllm-openai:cu130-nightly`

image is not a stable release . Pin to a specific build hash for any deployment you need to reproduce, or wait for a stable vLLM release that includes full NVFP4 Blackwell support upstream.

The built-in MTP speculative decoding head achieves an **85.4% token acceptance rate** at single-user baseline (512-token outputs), rising to 92.8% at 4,096-token outputs . No second draft model to load or manage — the MTP head is baked into the base checkpoint. At concurrency 1, output throughput is 55.9 tokens/s; at concurrency 32, it scales to 433.4 tokens/s . The community AEON-7 DFlash variant reports 117 tok/s greedy decoding on DGX Spark with 62–78% draft acceptance and 2.7–4.4 mean accepted tokens per target step .

The native context window is 131K tokens, extended to 262,144 via RoPE scaling . On DGX Spark, cap `--max-model-len`

at 65536 to stay within safe VRAM margins at 0.85 utilization. The full 262K context is accessible on H100/H200 with more VRAM headroom. Note that long-context RAG quality under NVFP4 versus BF16 at the 262K limit has not been independently benchmarked as of June 2026 — treat that range as best-effort until data appears.

The same vLLM endpoint handles image and video inputs alongside text once the server is running on Hopper or Blackwell — no additional flags needed for multimodal prompts. Multimodal inference quality under NVFP4 quantization is also unbenchmarked publicly, so evaluate against your specific workload rather than relying on text benchmark results as a proxy.

No. FP4 tensor core paths require Hopper (H100, H200) or Blackwell (GB200, GB300, DGX Spark) architecture. The RTX 4090 is Ada Lovelace (sm_89) and the A100 is Ampere (sm_80) — neither has native FP4 compute units. On those cards, use community GGUF quantizations via llama.cpp or the BF16 base model if you have 71+ GB VRAM available (RTX PRO 6000 96 GB or H100/A100 80 GB).

`--quantization modelopt`

actually do?
It tells vLLM to route weight loading through NVIDIA Model Optimizer's compressed-tensors backend, which understands the NVFP4 format and dispatches matrix multiplications through FP4 tensor cores. Without this flag, vLLM will not recognize the quantization scheme and will either throw an error at startup or attempt to interpret the weights as a different format — neither produces usable output.

0.5–0.8 points on most benchmarks per NVIDIA's official eval suite . MMLU Pro drops from 85.6 to 85.0; GPQA Diamond from 84.9 to 84.8; AIME 2025 from 89.2 to 88.8. On instruction-following (IFBench: 62.8 vs 62.3) and multimodal reasoning (MMMU Pro: 74.5 vs 74.1), NVFP4 marginally outperforms BF16 — likely a calibration dataset effect from the multi-turn Nemotron data.

No. The Multi-Token Prediction head is embedded in the Qwen3.6-35B-A3B checkpoint itself. Pass `--speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'`

to activate it — vLLM uses the model's own MTP head without downloading or loading a second checkpoint.

CUDA 13.0 and the `vllm/vllm-openai:cu130-nightly`

Docker image . Current stable vLLM releases lack FlashInfer CUTLASS MoE kernels for Blackwell sm_120/121a. Pin to a specific nightly build hash for any production deployment — a stable vLLM release with full NVFP4 Blackwell support had not shipped as of June 2026.

On a Hopper card, the path is now practical: one `vllm serve`

command with `--quantization modelopt`

and `--reasoning-parser qwen3`

, and you have a 35B reasoning model with 262K context, built-in chain-of-thought handling, and native tool calling — on a single GPU. The 3.06× memory reduction is the operational threshold between needing four-way tensor parallelism and fitting on one card.

Extend the baseline from here: add `--enable-auto-tool-choice --tool-call-parser qwen3`

for structured tool calling in agent workloads; toggle thinking mode off for latency-sensitive paths with `--default-chat-template-kwargs '{"enable_thinking": false}'`

; stress-test the 262K RAG path against your actual document lengths. A [RedHatAI mirror](https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4) is also on Hugging Face for enterprise environments with registry requirements.

On DGX Spark: the nightly image dependency is the main operational risk. Track the [AEON-7/Qwen3.6-NVFP4-DFlash](https://github.com/AEON-7/Qwen3.6-NVFP4-DFlash) repository for community patch status and watch upstream vLLM releases for when Blackwell sm_120/121a kernels land in a stable build. Until then, pin your nightly image hash.

*Last updated: 2026-06-01. Based on the nvidia/Qwen3.6-35B-A3B-NVFP4 model card (released 2026-05-28) and community deployment reports reviewed as of June 2026.*
