Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer NVIDIA released the Nemotron 3 Ultra NVFP4 checkpoint, a quantized model that achieves up to 5.9x higher inference throughput than GLM-5.1 754B FP4 on decode-heavy workloads while matching BF16 accuracy. The checkpoint uses mixed-precision quantization with NVFP4 for MoE experts and FP8 for other layers, reducing model size from 1,121 GB to 352.3 GB and halving hardware requirements. It runs on both Hopper and Blackwell architectures by converting weight formats automatically. As context windows grow longer, moving large model weights efficiently becomes critical to performance. A common way to address this is quantization, an optimization technique that compresses model weights into a smaller data format. One quantization format is NVFP4, an innovative 4-bit floating point introduced with NVIDIA Blackwell architecture. That’s the approach behind our new Nemotron 3 Ultra NVFP4 checkpoint: we quantized the model into NVFP4 using NVIDIA Model Optimizer. The result is a model that achieves up to 5.9x higher inference throughput than GLM-5.1 754B FP4 model on decode-heavy workloads while matching BF16 accuracy across nearly every benchmark, as shown in Figure 1. While the performance benefits of NVFP4 are well understood, the process of producing a high-quality NVFP4 checkpoint is not. This post walks through how we quantized Nemotron 3 Ultra 550B to NVFP4 with NVIDIA Model Optimizer, and shows developers how to generate the best quantized checkpoints for their own models. The Nemotron 3 Ultra NVFP4 checkpoint A common misconception is that every layer of an NVFP4 checkpoint is stored in NVFP4. As Table 1 shows, this isn’t the case: different layers are quantized to different precision formats, chosen according to each layer’s sensitivity to the architecture and its impact on model accuracy. After NVFP4 quantization, the Nemotron 3 Ultra model shrinks from 1,121 GB in BF16 down to 352.3 GB, a 3.2x reduction. The payoff is substantial, cutting the hardware footprint in half. Layer/operator | BF16 baseline | Quantized checkpoint precision | | Embedding, Output classification layer, MTP layers | BF16 | BF16 | | MoE routed experts | BF16 | NVFP4 | | MoE shared experts | BF16 | FP8 per-tensor | | Mamba mixer linears | BF16 | FP8 per-tensor | | Attention linears | BF16 | BF16 | | Latent MoE | BF16 | BF16 | | Mamba conv1d | BF16 | BF16 | | KV cache | BF16 | FP8 | | Mamba SSM cache | FP32 | FP16 with stochastic rounding | Table 1. BF16 baseline compared to the quantized checkpoint precision for each layer/operator from the Nemotron 3 Ultra paper A key innovation of the Nemotron 3 Ultra NVFP4 is that a single checkpoint can run on both NVIDIA Hopper and Blackwell. It achieves this by converting the weight format to match the hardware it runs on. On Hopper, which lacks native FP4 tensor cores, the serving framework automatically switches to W4A16. On Blackwell, it uses native W4A4. While W8A8 8-bit weights, 8-bit activations seems like the obvious Hopper choice, its larger memory footprint leaves too little headroom to fit Multi-Token Prediction MTP . We found MTP could only fit alongside W4A16 4-bit weights, 16-bit activations so W4A16 matches or beats it across the board. Read the full Nemotron 3 Ultra technical report Section 4.6 https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf to learn more. How we found the optimal NVFP4 checkpoint Finding an optimal NVFP4 checkpoint requires some iterations. We dive into the developer story of how we got an NVFP4 checkpoint in this section. The challenge of quantizing at FP4 With FP4 quantization, there are only 8 positive values 0, 0.5, 1, 1.5, 2, 3, 4, and 6 to represent an entire block of weights. We need to determine how to map the original range of values. This is controlled by a scale , essentially a multiplier that determines the granularity of the representation. Choosing a poor scale means we either waste precision on small values or clip large values, both of which hurt model quality. So how should we choose the optimal scale factor? There are several approaches. Max scaling Here, we set the scale so the largest value in the block maps to the maximum representable FP4 value. However, with the presence of a single large weight outlier, the max scaling compresses every other value in the block into a narrow range, which can end up flushing these values to zero. This information loss may adversely affect accuracy. Max scaling preserves the highest magnitude value in the block, with a potential side effect of flushing other values to zero. Try it with NVIDIA Model Optimizer: W4A4 — weights + activations to NVFP4 default, max scaling model = mtq.quantize model, mtq.NVFP4 DEFAULT CFG, forward loop=forward loop Max scaling also called absmax , since the scale is set entirely by the block’s absolute maximum is the simplest option, but that sensitivity to outliers makes it rarely the best one. This is exactly the gap we hit on our prior model, NVIDIA Nemotron 3 Super: naive absmax NVFP4 PTQ left an accuracy gap, so the team evaluated a range of alternative calibration strategies that don’t let a single outlier dictate the scale, from mean squared error MSE -based weight scaling to GPTQ https://arxiv.org/abs/2210.17323 , an efficient method that uses second-order information to encode weights. Algorithm | Details | MMLU-Pro | GPQA | LiveCodeBench | AA-LCR | | BF16 | — | 83.49 | 79.92 | 72.907 | 53.00 | | Default NVFP4 PTQ Baseline algorithm | Static per-tensor scales are computed using max-value calibration; per-block scales are computed dynamically from block maximum values. | 82.99 | 79.29 | 70.18 | 55.50 | | Weight per-block scales minimizing MSE | Weight per-block scales are swept to minimize per-block MSE. | 83.31 | 79.92 | 71.37 | 56.75 | | Weight per-block scales to minimize output MSE | Weight per-block scales are swept independently to minimize GEMM output MSE. | 83.05 | 78.98 | 71.00 | 57.06 | | GPTQ | GPTQ Frantar et al., 2023 is used for weight quantization. | 83.11 | 80.05 | 69.79 | 57.87 | Table 2. Experiment results for Nemotron 3 Super quantization. The team tried several quantization methods and evaluated the accuracy change across four tasks. For more information, see the Nemotron 3 Super paper https://arxiv.org/abs/2604.12374 Mean squared error scaling Another approach is mean squared error MSE scaling, which searches for the scale that minimizes average reconstruction error across the whole block. However, lower MSE does not always translate to better model accuracy. MSE calibration reduced per-tensor weight error by 27.1% over four-over-six scaling in our Nemotron 3 Ultra experiments, yet produced no consistent improvement on downstream benchmarks. Try MSE-based scaling with NVIDIA Model Optimizer: model = mtq.quantize model, mtq.NVFP4 W4A4 WEIGHT MSE FP8 SWEEP CFG, forward loop=forward loop For our previous model, NVIDIA Nemotron 3 Super, the final quantization recipe combined MSE-based block scaling for weights with a per-tensor FP8 sweep and dynamic max-based scaling for activations. Combining MSE weights with the FP8 activation sweep gave the best accuracy-to-size tradeoff of everything we tried, and it became our optimal NVFP4 configuration for Super. Max and MSE scaling both pick a scale to minimize overall rounding error, but neither pays attention to where the error comes from on the grid. For Nemotron 3 Ultra, we used a scaling method that chooses the range based on the error from the gap in the grid. Four-over-six scaling Remember how NVFP4 can only represent 8 positive values: 0, 0.5, 1, 1.5, 2, 3, 4, and 6. Notice that after 4, the next value jumps straight to 6. Any weight that falls in that range gets rounded aggressively to either 4 or 6, sometimes incurring over 13% error on a single value. Four-over-six fixes this as each block of weights independently chooses between scaling to a maximum of M=4 or M=6, picking whichever minimizes reconstruction error. Four-over-six works on weights and falls back to the default NVFP4 on activations. When M=6 wins with block: 2, 4, 5.9, 6 At a scale of M=1, values 2, 4, and 6 map exactly onto FP4 grid points and only 5.9 rounds to 6 at negligible cost. Scaling to M=4 pushes 2 to 2.25 and 4 to 4.5, introducing error. When M=4 wins with block: 10, 20, 30, 40 Scaling to M=6 maps 30 to 4.62, which rounds down to 4, a 13% error. Scaling to M=4 instead maps 10, 20, 30, and 40 exactly onto 1, 2, 3, and 4, with zero rounding error across the entire block. MSE: 4.33 vs 0.0. Four-over-six was used to set the FP4 routed-expert weight scales in Nemotron 3 Ultra, raising the global per-tensor weight scale by 1.75x, and with each microblock picking the M=4 or M=6 grid. Across all 49,152 projection weights in the model’s 48 MoE expert layers, it cut the median reconstruction MSE by 16.4% compared to standard max calibration, and delivered the best downstream result in the balanced 5.03-BPE setting: 98.5% median recovery relative to BF16, ahead of max 96.8% and MSE 98.4% . Try four-over-six with NVIDIA Model Optimizer: model = mtq.quantize model, mtq.NVFP4 FOUR OVER SIX CFG, forward loop=forward loop NVFP4 FOUR OVER SIX CFG will be released on the upcoming 0.46 NVIDIA Model Optimizer in July. View the Nemotron 3 Ultra PTQ example . Bits-per-element Effective bits-per-element BPE refers to the average number of bits required to store all weights of the model. A model with all BF16 weights uses 16 effective bits-per-element, while a half-FP8, half-BF16 model uses only 12. NVFP4 adds per-block and per-tensor scaling overhead, bringing its minimum to 4.5 effective bits-per-element. The per-tensor scale’s 32 bits are amortized across the full tensor and is assumed to be negligible in the overall BPE calculation. The goal is to search for the quantization configuration that pushes effective BPE as low as possible without sacrificing accuracy. This is tricky because layers are not equally robust. Some are sensitive to quantization and must stay in higher precision, which raises the effective BPE. Since each layer can be quantized at a different level or left unquantized, the number of possible combinations grows exponentially, making an exhaustive search impractical and a smarter strategy necessary. NVIDIA Model Optimizer AutoQuantize mtq.auto quantize does it for you. Instead of a fixed config, you give it a target bit budget for example auto quantize bits=4.8 and a list of candidate formats, such as NVFP4 DEFAULT CFG and FP8 DEFAULT CFG . It then scores each layer’s sensitivity and searches for the per-layer format assignment that meets the budget at the best accuracy, keeping the most sensitive layers in the higher-precision format or skipping them entirely. python import modelopt.torch.quantization as mtq model, search state = mtq.auto quantize model, constraints={"auto quantize bits": 4.8}, quantization formats= "NVFP4 DEFAULT CFG", "FP8 DEFAULT CFG" , data loader=calib dataloader, forward step=forward step, loss func=loss func, To find the right bits-per-element for Nemotron 3 Ultra, we swept over five operating points ranging from 4.85 to 7.19 effective bits-per-element, comparing accuracy over several benchmarks in Table 3. The key signal came from AA-LCR, where going from 4.85 to 5.03 improved the benchmark by 2.4 points, and benchmark performance then flattened again beyond 5.03. This makes 5.03 BPE the sweet spot. Quantization bits-per-element | |||||| Task | Metric | 4.85 | 5.03† | 5.25 | 5.43 | 7.19 | Coding | |||||| | SciCode | pass@1 avg-16 , subtask acc | 43.82 | 43.88 | 43.45 | 43.27 | 43.44 | Scientific Reasoning | |||||| | GPQA Diamond | pass@1 avg-32 , sym. correct | 84.66 | 84.33 | 84.75 | 84.12 | 84.52 | | HLE | pass@1, judge correct | 24.24 | 24.84 | 25.00 | 24.98 | 25.44 | | CritPt | pass@1 avg-8 , accuracy | 3.04 | 3.93 | 5.18 | 4.82 | 4.46 | General | |||||| | AA-Omniscience | pass@1 avg-20 , judge correct | 29.21 | 29.75 | 29.18 | 29.29 | 29.00 | pass@1 avg-20 , non-hallucination | 54.13 | 51.59 | 51.84 | 51.70 | 52.81 | | | IFBench | pass@1 avg-8 , avg. score | 79.34 | 79.26 | 79.83 | 79.53 | 79.83 | Long Context | |||||| | AA-LCR | pass@1 avg-16 , judge correct | 62.25 | 64.69 | 64.19 | 64.94 | 65.00 | Table 3. Accuracy compared to effective bits-per-element from the Nemotron 3 Ultra paper https://arxiv.org/abs/2606.15007 How we quantized Nemotron 3 Ultra to NVFP4 with Model Optimizer Unlike Nemotron 3 Super 120B, Nemotron 3 Ultra is a 550B model, so it benefits significantly from parallelizing the quantization process. For this reason, we support two quantization paths. Both paths are powered by NVIDIA Model Optimizer | || Metric | Hugging Face Transformers | Megatron-LM | | Compute | 4 × B300 | 16 × B300; Expert parallelism = data parallelism = 16 | | Model loading time | 40 min | < 2 min | | Model loading and calibration time | 85 min | 9 min | | Export | 42 min | 33 min | Total time | 120 min | 45 min | Table 4. Quantization time comparing Hugging Face Transformers to Megatron-LM Quantizing Nemotron 3 Ultra to NVFP4 follows the NVIDIA ModelOpt post-training quantization PTQ pipeline in NVIDIA Megatron-LM https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post training/modelopt . With the parallel route, the pretrained checkpoint is first converted to Megatron-LM format and then quantized with a single call to quantize.sh , passing an NVFP4 quantization config as the recipe. On the backend, Megatron-LM shards the model across GPUs with expert and data parallelism EP = DP = 16 on 16×B300s , so the calibration forward pass runs distributed across all devices. This reduces load and calibration from ~85 minutes to ~9. Calibration runs nemotron-post-training-dataset-v2 to fit the per-block scales, and the precision policy is entirely config-driven. Select it by passing a config to quantize.sh http://quantize.sh/ . Either a built-in name e.g., NVFP4 DEFAULT CFG , FP8 DEFAULT CFG or a YAML recipe path, which is what ultimately gets handed to mtq.quantize model, config, forward loop to install the quantizers and run calibration.Try Four-Over-Six Scaling with NVIDIA Model Optimizer: HF MODEL CKPT=nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 Step 1 — Quantize to NVFP4 TP=4 \ MLM MODEL SAVE=/tmp/Nemotron-3-Ultra quant \ ./quantize.sh nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 huggingface/models/nvidia/Nemotron-3-Ultra-550B-A55B/ptq/nvfp4-4o6 Step 2 — Export the quantized checkpoint PP=1 \ MLM MODEL CKPT=/tmp/Nemotron-3-Ultra quant \ EXPORT DIR=/tmp/Nemotron-3-Ultra NVFP4 46 HF \ ./export.sh nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 NVFP4 FOUR OVER SIX CFG support for four-over-six is landing in NVIDIA Model Optimizer 0.46. The Nemotron 3 Ultra recipe for four-over-six is available on GitHub https://github.com/NVIDIA/Model-Optimizer/blob/db5497e2b1a5ab15a65aca9c2f157a56d5d6a276/modelopt recipes/huggingface/models/nvidia/Nemotron-3-Ultra-550B-A55B/ptq/nvfp4-46-max.yaml . Four-over-six works on weights and falls back to default NVFP4 on activations. Customizing quantization configs NVIDIA Model Optimizer is built to be customizable with different quantization configs. The built-in NVFP4 configs range from NVFP4 DEFAULT CFG , which quantizes broadly, to more selective presets like NVFP4 MLP ONLY CFG , NVFP4 EXPERTS ONLY CFG , and NVFP4 OMLP ONLY CFG that restrict FP4 to the MLP and expert layers while keeping the sensitive attention projections in higher precision. Under the hood, a config is an ordered list of rules matched against module-name patterns, and mtq.quantize applies them. Weight quantization is governed by rules targeting the weight quantizer pattern, where you set the format for NVFP4, E2M1 elements with 16-wide blocks and E4M3 block scales , while activation quantization is governed by separate rules on the input quantizer pattern. Since the two are independent, you can quantize weights only or weights and activations together, and you can carve out exceptions for specific modules by appending rules that disable them. For anything beyond the built-in presets, you can write a full YAML recipe and load it with --recipe , which then fully defines the quant config. The following Nemotron-3 Ultra recipe applies NVFP4 with four-over-six to the routed-expert weights, keeps the shared experts and Mamba projections in FP8, uses an FP8 KV cache, and leaves everything else in BF16. The complete recipe ships with NVIDIA Model Optimizer’s recipe library: nvfp4-4o6.yaml https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt recipes/huggingface/models/nvidia/Nemotron-3-Ultra-550B-A55B/ptq/nvfp4-4o6.yaml Nemotron 3 Ultra NVFP4 mixed-precision recipe with Four-Over-Six 4/6 Example recipe for HuggingFace models, for Megatron-compatible recipe see the full recipe link quantize: algorithm: method: mse fp8 scale sweep: false start multiplier: 1.0 M=6 keep amax stop multiplier: 1.5 M=4 amax x 6/4 step size: 0.5 candidates 1.0, 1.5 quant cfg: Disable everything by default; later rules re-enable specific modules. - quantizer name: ' ' enable: false MoE routed experts - NVFP4 W4A4, block 16, e4m3 block scale. 4/6 adaptive block scaling on weights only; not actvivations HF names: backbone.layers. .mixer.experts. .{up,down} proj - quantizer name: ' mixer.experts. weight quantizer' enable: true cfg: block sizes: {-1: 16, type: static, scale bits: e4m3, four over six: true} num bits: e2m1 - quantizer name: ' mixer.experts. input quantizer' enable: true cfg: block sizes: {-1: 16, type: dynamic, scale bits: e4m3} num bits: e2m1 Shared experts + Mamba in/out proj - FP8 per-tensor weights+activations . - quantizer name: ' mixer.shared experts ' enable: true cfg: {num bits: e4m3, axis: null} - quantizer name: ' mixer.in proj ' enable: true cfg: {num bits: e4m3, axis: null} - quantizer name: ' mixer.out proj ' enable: true cfg: {num bits: e4m3, axis: null} KV cache - FP8. - quantizer name: ' kv bmm quantizer' enable: true cfg: {num bits: e4m3} While we walked through this on Nemotron 3 Ultra, the same pipeline works with any Hugging Face model checkpoint. Simply point Model Optimizer at a model card from the Hub or a local path, pick a config a built-in preset or your own recipe , and run the same quantize and export steps. python import modelopt.torch.quantization as mtq from modelopt.torch.export import export hf checkpoint from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from pretrained "