# Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer

> Source: <https://developer.nvidia.com/blog/creating-the-nvidia-nemotron-3-ultra-nvfp4-checkpoint-with-nvidia-model-optimizer/>
> Published: 2026-06-26 16:00:35+00:00

As context windows grow longer, moving large model weights efficiently becomes critical to performance. A common way to address this is quantization, an optimization technique that compresses model weights into a smaller data format. One quantization format is NVFP4, an innovative 4-bit floating point introduced with NVIDIA Blackwell architecture.

That’s the approach behind our new Nemotron 3 Ultra NVFP4 checkpoint: we quantized the model into NVFP4 using NVIDIA Model Optimizer. The result is a model that achieves up to 5.9x higher inference throughput than GLM-5.1 754B FP4 model on decode-heavy workloads while matching BF16 accuracy across nearly every benchmark, as shown in Figure 1.

While the performance benefits of NVFP4 are well understood, the process of producing a high-quality NVFP4 checkpoint is not. This post walks through how we quantized Nemotron 3 Ultra (550B) to NVFP4 with NVIDIA Model Optimizer, and shows developers how to generate the best quantized checkpoints for their own models.

## The Nemotron 3 Ultra NVFP4 checkpoint

A common misconception is that every layer of an NVFP4 checkpoint is stored in NVFP4. As Table 1 shows, this isn’t the case: different layers are quantized to different precision formats, chosen according to each layer’s sensitivity to the architecture and its impact on model accuracy. After NVFP4 quantization, the Nemotron 3 Ultra model shrinks from 1,121 GB in BF16 down to 352.3 GB, a 3.2x reduction. The payoff is substantial, cutting the hardware footprint in half.

Layer/operator | BF16 baseline | Quantized checkpoint precision |
| Embedding, Output classification layer, MTP layers | BF16 | BF16 |
| MoE routed experts | BF16 | NVFP4 |
| MoE shared experts | BF16 | FP8 per-tensor |
| Mamba mixer linears | BF16 | FP8 per-tensor |
| Attention linears | BF16 | BF16 |
| Latent MoE | BF16 | BF16 |
| Mamba conv1d | BF16 | BF16 |
| KV cache | BF16 | FP8 |
| Mamba SSM cache | FP32 | FP16 with stochastic rounding |

*Table 1. BF16 baseline compared to the quantized checkpoint precision for each layer/operator**from the Nemotron 3 Ultra paper*A key innovation of the Nemotron 3 Ultra NVFP4 is that a single checkpoint can run on both NVIDIA Hopper and Blackwell. It achieves this by converting the weight format to match the hardware it runs on. On Hopper, which lacks native FP4 tensor cores, the serving framework automatically switches to W4A16. On Blackwell, it uses native W4A4.

While W8A8 (8-bit weights, 8-bit activations) seems like the obvious Hopper choice, its larger memory footprint leaves too little headroom to fit Multi-Token Prediction (MTP). We found MTP could only fit alongside W4A16 (4-bit weights, 16-bit activations) so W4A16 matches or beats it across the board. Read the full [ Nemotron 3 Ultra technical report (Section 4.6)](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf) to learn more.

## How we found the optimal NVFP4 checkpoint

Finding an optimal NVFP4 checkpoint requires some iterations. We dive into the developer story of how we got an NVFP4 checkpoint in this section.

### The challenge of quantizing at FP4

With FP4 quantization, there are only 8 positive values [0, 0.5, 1, 1.5, 2, 3, 4, and 6] to represent an entire block of weights. We need to determine how to map the original range of values. This is controlled by a *scale*, essentially a multiplier that determines the granularity of the representation. Choosing a poor scale means we either waste precision on small values or clip large values, both of which hurt model quality. So how should we choose the optimal scale factor? There are several approaches.

#### Max scaling

Here, we set the scale so the largest value in the block maps to the maximum representable FP4 value. However, with the presence of a single large weight outlier, the max scaling compresses every other value in the block into a narrow range, which can end up flushing these values to zero. This information loss may adversely affect accuracy. Max scaling preserves the highest magnitude value in the block, with a potential side effect of flushing other values to zero.

**Try it with NVIDIA Model Optimizer:**

```
# W4A4 — weights + activations to NVFP4 (default, max scaling)
model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop=forward_loop)
```

Max scaling (also called `absmax`

, since the scale is set entirely by the block’s absolute maximum) is the simplest option, but that sensitivity to outliers makes it rarely the best one.

This is exactly the gap we hit on our prior model, NVIDIA Nemotron 3 Super: naive `absmax`

NVFP4 PTQ left an accuracy gap, so the team evaluated a range of alternative calibration strategies that don’t let a single outlier dictate the scale, from mean squared error (MSE)-based weight scaling to [GPTQ](https://arxiv.org/abs/2210.17323), an efficient method that uses second-order information to encode weights.

Algorithm | Details | MMLU-Pro | GPQA | LiveCodeBench | AA-LCR |
| BF16 | — | 83.49 | 79.92 | 72.907 | 53.00 |
| Default NVFP4 PTQ (Baseline algorithm) | Static per-tensor scales are computed using max-value calibration; per-block scales are computed dynamically from block maximum values. | 82.99 | 79.29 | 70.18 | 55.50 |
| Weight per-block scales minimizing MSE | Weight per-block scales are swept to minimize per-block MSE. | 83.31 | 79.92 | 71.37 | 56.75 |
| Weight per-block scales to minimize output MSE | Weight per-block scales are swept independently to minimize GEMM output MSE. | 83.05 | 78.98 | 71.00 | 57.06 |
| GPTQ | GPTQ (Frantar et al., 2023) is used for weight quantization. | 83.11 | 80.05 | 69.79 | 57.87 |

*Table 2. Experiment results for Nemotron 3 Super quantization. The team tried several quantization methods and evaluated the accuracy change across four tasks. For more information, see the*

[Nemotron 3 Super paper](https://arxiv.org/abs/2604.12374)### Mean squared error scaling

Another approach is mean squared error (MSE) scaling, which searches for the scale that minimizes average reconstruction error across the whole block.

However, lower MSE does not always translate to better model accuracy. MSE calibration reduced per-tensor weight error by 27.1% over four-over-six scaling in our Nemotron 3 Ultra experiments, yet produced no consistent improvement on downstream benchmarks.

**Try MSE-based scaling with NVIDIA Model Optimizer: **

```
model = mtq.quantize(model, mtq.NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG, forward_loop=forward_loop)
```

For our previous model, NVIDIA Nemotron 3 Super, the final quantization recipe combined MSE-based block scaling for weights with a per-tensor FP8 sweep and dynamic max-based scaling for activations. Combining MSE weights with the FP8 activation sweep gave the best accuracy-to-size tradeoff of everything we tried, and it became our optimal NVFP4 configuration for Super.

Max and MSE scaling both pick a scale to minimize overall rounding error, but neither pays attention to where the error comes from on the grid. For Nemotron 3 Ultra, we used a scaling method that chooses the range based on the error from the gap in the grid.

### Four-over-six scaling

Remember how NVFP4 can only represent 8 positive values: 0, 0.5, 1, 1.5, 2, 3, 4, and 6. Notice that after 4, the next value jumps straight to 6. Any weight that falls in that range gets rounded aggressively to either 4 or 6, sometimes incurring over 13% error on a single value.

Four-over-six fixes this as each block of weights independently chooses between scaling to a maximum of M=4 or M=6, picking whichever minimizes reconstruction error. Four-over-six works on weights and falls back to the default NVFP4 on activations.

**When M=6 wins with block: [2, 4, 5.9, 6]**

At a scale of M=1, values 2, 4, and 6 map exactly onto FP4 grid points and only 5.9 rounds to 6 at negligible cost. Scaling to M=4 pushes 2 to 2.25 and 4 to 4.5, introducing error.

**When M=4 wins with block: [10, 20, 30, 40]**

Scaling to M=6 maps 30 to 4.62, which rounds down to 4, a 13% error. Scaling to M=4 instead maps 10, 20, 30, and 40 exactly onto 1, 2, 3, and 4, with zero rounding error across the entire block. MSE: 4.33 vs 0.0.

Four-over-six was used to set the FP4 routed-expert weight scales in Nemotron 3 Ultra, raising the global per-tensor weight scale by 1.75x, and with each microblock picking the M=4 or M=6 grid. Across all 49,152 projection weights in the model’s 48 MoE expert layers, it cut the median reconstruction MSE by 16.4% compared to standard max calibration, and delivered the best downstream result in the balanced 5.03-BPE setting: 98.5% median recovery relative to BF16, ahead of max (96.8%) and MSE (98.4%).

**Try four-over-six with NVIDIA Model Optimizer: **

```
model = mtq.quantize(model,
mtq.NVFP4_FOUR_OVER_SIX_CFG, forward_loop=forward_loop)
```

`NVFP4_FOUR_OVER_SIX_CFG`

will be released on the upcoming 0.46 NVIDIA Model Optimizer in July. View the* Nemotron 3 Ultra PTQ example**.** *

### Bits-per-element

Effective bits-per-element (BPE) refers to the average number of bits required to store all weights of the model. A model with all BF16 weights uses 16 effective bits-per-element, while a half-FP8, half-BF16 model uses only 12. NVFP4 adds per-block and per-tensor scaling overhead, bringing its minimum to 4.5 effective bits-per-element. The per-tensor scale’s 32 bits are amortized across the full tensor and is assumed to be negligible in the overall BPE calculation.

The goal is to search for the quantization configuration that pushes effective BPE as low as possible without sacrificing accuracy. This is tricky because layers are not equally robust. Some are sensitive to quantization and must stay in higher precision, which raises the effective BPE. Since each layer can be quantized at a different level or left unquantized, the number of possible combinations grows exponentially, making an exhaustive search impractical and a smarter strategy necessary.

NVIDIA Model Optimizer AutoQuantize (`mtq.auto_quantize`

) does it for you. Instead of a fixed config, you give it a target bit budget (for example `auto_quantize_bits=4.8`

) and a list of candidate formats, such as `NVFP4_DEFAULT_CFG`

and `FP8_DEFAULT_CFG`

. It then scores each layer’s sensitivity and searches for the per-layer format assignment that meets the budget at the best accuracy, keeping the most sensitive layers in the higher-precision format or skipping them entirely.

``` python
import modelopt.torch.quantization as mtq

model, search_state = mtq.auto_quantize(
    model,
    constraints={"auto_quantize_bits": 4.8},
    quantization_formats=["NVFP4_DEFAULT_CFG", "FP8_DEFAULT_CFG"],
    data_loader=calib_dataloader,
    forward_step=forward_step,
    loss_func=loss_func,
```

To find the right bits-per-element for Nemotron 3 Ultra, we swept over five operating points ranging from 4.85 to 7.19 effective bits-per-element, comparing accuracy over several benchmarks in Table 3. The key signal came from AA-LCR, where going from 4.85 to 5.03 improved the benchmark by 2.4 points, and benchmark performance then flattened again beyond 5.03. This makes 5.03 BPE the sweet spot.

Quantization (bits-per-element) | ||||||
Task | Metric | 4.85 | 5.03† | 5.25 | 5.43 | 7.19 |
Coding | ||||||
| SciCode | pass@1 (avg-16), subtask acc | 43.82 | 43.88 | 43.45 | 43.27 | 43.44 |
Scientific Reasoning | ||||||
| GPQA Diamond | pass@1 (avg-32), sym. correct | 84.66 | 84.33 | 84.75 | 84.12 | 84.52 |
| HLE | pass@1, judge correct | 24.24 | 24.84 | 25.00 | 24.98 | 25.44 |
| CritPt | pass@1 (avg-8), accuracy | 3.04 | 3.93 | 5.18 | 4.82 | 4.46 |
General | ||||||
| AA-Omniscience | pass@1 (avg-20), judge correct | 29.21 | 29.75 | 29.18 | 29.29 | 29.00 |
pass@1 (avg-20), non-hallucination | 54.13 | 51.59 | 51.84 | 51.70 | 52.81 | |
| IFBench | pass@1 (avg-8), avg. score | 79.34 | 79.26 | 79.83 | 79.53 | 79.83 |
Long Context | ||||||
| AA-LCR | pass@1 (avg-16), judge correct | 62.25 | 64.69 | 64.19 | 64.94 | 65.00 |

*Table 3. Accuracy compared to effective bits-per-element from the*

[Nemotron 3 Ultra paper](https://arxiv.org/abs/2606.15007)## How we quantized Nemotron 3 Ultra to NVFP4 with Model Optimizer

Unlike Nemotron 3 Super 120B, Nemotron 3 Ultra is a 550B model, so it benefits significantly from parallelizing the quantization process. For this reason, we support two quantization paths.

Both paths are powered by NVIDIA Model Optimizer | ||
Metric | Hugging Face Transformers | Megatron-LM |
| Compute | 4 × B300 | 16 × B300; Expert parallelism = data parallelism = 16 |
| Model loading time | 40 min | < 2 min |
| Model loading and calibration time | 85 min | 9 min |
| Export | 42 min | 33 min |
Total time | 120 min | 45 min |

*Table 4. Quantization time comparing Hugging Face Transformers to Megatron-LM*

Quantizing Nemotron 3 Ultra to NVFP4 follows the [NVIDIA ModelOpt post-training quantization (PTQ) pipeline in NVIDIA Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt). With the parallel route, the pretrained checkpoint is first converted to Megatron-LM format and then quantized with a single call to `quantize.sh`

, passing an NVFP4 quantization config as the recipe. On the backend, Megatron-LM shards the model across GPUs with expert and data parallelism (`EP = DP = 16`

on 16×B300s), so the calibration forward pass runs distributed across all devices. This reduces load and calibration from ~85 minutes to ~9.

Calibration runs `nemotron-post-training-dataset-v2`

to fit the per-block scales, and the precision policy is entirely config-driven. Select it by passing a config to [ quantize.sh](http://quantize.sh/). Either a built-in name (e.g.,

`NVFP4_DEFAULT_CFG`

, `FP8_DEFAULT_CFG`

) or a YAML recipe path, which is what ultimately gets handed to `mtq.quantize(model, config, forward_loop)`

to install the quantizers and run calibration.Try Four-Over-Six Scaling with NVIDIA Model Optimizer:

```
HF_MODEL_CKPT=nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16

# Step 1 — Quantize to NVFP4
TP=4 \
MLM_MODEL_SAVE=/tmp/Nemotron-3-Ultra_quant \
./quantize.sh nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 huggingface/models/nvidia/Nemotron-3-Ultra-550B-A55B/ptq/nvfp4-4o6

# Step 2 — Export the quantized checkpoint
PP=1 \
MLM_MODEL_CKPT=/tmp/Nemotron-3-Ultra_quant \
EXPORT_DIR=/tmp/Nemotron-3-Ultra_NVFP4_46_HF \
./export.sh nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
```

`NVFP4_FOUR_OVER_SIX_CFG`

support for four-over-six is landing in NVIDIA Model Optimizer 0.46. The Nemotron 3 Ultra recipe for four-over-six is [available on GitHub](https://github.com/NVIDIA/Model-Optimizer/blob/db5497e2b1a5ab15a65aca9c2f157a56d5d6a276/modelopt_recipes/huggingface/models/nvidia/Nemotron-3-Ultra-550B-A55B/ptq/nvfp4-46-max.yaml). Four-over-six works on weights and falls back to default NVFP4 on activations.

## Customizing quantization configs

NVIDIA Model Optimizer is built to be customizable with different quantization configs. The built-in NVFP4 configs range from `NVFP4_DEFAULT_CFG`

, which quantizes broadly, to more selective presets like `NVFP4_MLP_ONLY_CFG`

, `NVFP4_EXPERTS_ONLY_CFG`

, and `NVFP4_OMLP_ONLY_CFG`

that restrict FP4 to the MLP and expert layers while keeping the sensitive attention projections in higher precision.

Under the hood, a config is an ordered list of rules matched against module-name patterns, and `mtq.quantize()`

applies them. Weight quantization is governed by rules targeting the `*weight_quantizer`

pattern, where you set the format (for NVFP4, `E2M1`

elements with 16-wide blocks and `E4M3`

block scales), while activation quantization is governed by separate rules on the `*input_quantizer`

pattern.

Since the two are independent, you can quantize weights only or weights and activations together, and you can carve out exceptions for specific modules by appending rules that disable them. For anything beyond the built-in presets, you can write a full YAML recipe and load it with `--recipe`

, which then fully defines the quant config.

The following Nemotron-3 Ultra recipe applies NVFP4 with four-over-six to the routed-expert weights, keeps the shared experts and Mamba projections in FP8, uses an FP8 KV cache, and leaves everything else in BF16. The complete recipe ships with NVIDIA Model Optimizer’s recipe library:[ nvfp4-4o6.yaml](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/huggingface/models/nvidia/Nemotron-3-Ultra-550B-A55B/ptq/nvfp4-4o6.yaml)

```
# Nemotron 3 Ultra NVFP4 mixed-precision recipe with Four-Over-Six (4/6)
# Example recipe for HuggingFace models, for Megatron-compatible recipe see the full recipe link

quantize:
  algorithm:
    method: mse
    fp8_scale_sweep: false
    start_multiplier: 1.0   # M=6 (keep amax)
    stop_multiplier: 1.5    # M=4 (amax x 6/4)
    step_size: 0.5          # candidates [1.0, 1.5]

  quant_cfg:
    # Disable everything by default; later rules re-enable specific modules.
    - quantizer_name: '*'
      enable: false

    # MoE routed experts -> NVFP4 W4A4, block 16, e4m3 block scale.
    # 4/6 adaptive block scaling on weights only; not actvivations
    # HF names: backbone.layers.*.mixer.experts.*.{up,down}_proj
    - quantizer_name: '*mixer.experts.*weight_quantizer'
      enable: true
      cfg:
        block_sizes: {-1: 16, type: static, scale_bits: e4m3, four_over_six: true}
        num_bits: e2m1
    - quantizer_name: '*mixer.experts.*input_quantizer'
      enable: true
      cfg:
        block_sizes: {-1: 16, type: dynamic, scale_bits: e4m3}
        num_bits: e2m1

    # Shared experts + Mamba in/out_proj -> FP8 per-tensor (weights+activations).
    - quantizer_name: '*mixer.shared_experts*'
      enable: true
      cfg: {num_bits: e4m3, axis: null}
    - quantizer_name: '*mixer.in_proj*'
      enable: true
      cfg: {num_bits: e4m3, axis: null}
    - quantizer_name: '*mixer.out_proj*'
      enable: true
      cfg: {num_bits: e4m3, axis: null}

    # KV cache -> FP8.
    - quantizer_name: '*[kv]_bmm_quantizer'
      enable: true
      cfg: {num_bits: e4m3}
```

While we walked through this on Nemotron 3 Ultra, **the same pipeline works with any Hugging Face model checkpoint. **Simply point Model Optimizer at a model card from the Hub or a local path, pick a config (a built-in preset or your own recipe), and run the same quantize and export steps.

``` python
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("<your-hf-model-card>")

# Calibrate + quantize with the config of your choice
model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop)

# Export a unified HF checkpoint for TRT-LLM / vLLM / SGLang
export_hf_checkpoint(model, export_dir="<export_path>")
```

**Try the one-click launcher**

To simplify deployment, the [Model Optimizer launcher](https://github.com/NVIDIA/Model-Optimizer/tree/main/tools/launcher) automates the entire Ultra PTQ and export workflow. After completing the setup steps in the launcher README, the workflow can be launched via the [Nemotron 3 Ultra YAML recipe](https://github.com/NVIDIA/Model-Optimizer/blob/main/tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_lm_ptq.yaml) with a single command from a local machine:

```
uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_lm_ptq.yaml --yes
```

Once launched, the workflow handles the remaining quantization and export steps automatically, assuming access to a Slurm cluster with sufficient GPU resources. This example was validated on four nodes, each equipped with four NVIDIA Blackwell GPUs.

For smaller-scale deployments, a PTQ example is also available for Nemotron-3 Super:

```
uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
```

## Get started

This process can be reproduced using the full recipe available in the open-source NVIDIA Model Optimizer GitHub repository. Model Optimizer is a community-driven project, and contributions are encouraged. Issues can be filed to report bugs or request features, the project roadmap can be reviewed for upcoming work, and pull requests can be submitted to contribute improvements. See `CONTRIBUTING.md`

for contribution guidelines and getting-started information.

Learn more with the following resources:

[NVIDIA Model Optimizer GitHub](https://github.com/NVIDIA/Model-Optimizer)[NVIDIA Model Optimizer LLM PTQ Documentation](https://github.com/NVIDIA/Model-Optimizer/tree/47a33db9b639c5d25baaa3e49526f9703491eb08/examples/llm_ptq)[NVIDIA Model Optimizer with Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt)[Nemotron 3 Ultra Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf)[Nemotron 3 Ultra NVFP4 Checkpoint](https://build.nvidia.com/nvidia/nemotron-3-ultra-550b-a55b/modelcard)

## Acknowledgments

*This work would not have been possible without the close collaboration between the NVIDIA Model Optimizer team and the Nemotron team. We thank the engineers across both teams who contributed to the quantization pipeline, evaluation infrastructure, and model training. Special thanks to the Megatron-LM team for enabling distributed quantization at scale, and to the Nemotron team for the benchmark suite used to validate the FP4 recipes. We also thank the broader NVIDIA Research and Applied Deep Learning teams for their continued support and feedback throughout this project.*

*In particular, we thank Asma Kuriparambil Thekkumpate, Jenny Chen, and Jinhang Choi for leading the implementation of the NVFP4 quantization on Nemotron 3 Ultra. *
