{"slug": "creating-the-nvidia-nemotron-3-ultra-nvfp4-checkpoint-with-nvidia-model", "title": "Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer", "summary": "NVIDIA released the Nemotron 3 Ultra NVFP4 checkpoint, a quantized model that achieves up to 5.9x higher inference throughput than GLM-5.1 754B FP4 on decode-heavy workloads while matching BF16 accuracy. The checkpoint uses mixed-precision quantization with NVFP4 for MoE experts and FP8 for other layers, reducing model size from 1,121 GB to 352.3 GB and halving hardware requirements. It runs on both Hopper and Blackwell architectures by converting weight formats automatically.", "body_md": "As context windows grow longer, moving large model weights efficiently becomes critical to performance. A common way to address this is quantization, an optimization technique that compresses model weights into a smaller data format. One quantization format is NVFP4, an innovative 4-bit floating point introduced with NVIDIA Blackwell architecture.\n\nThat’s the approach behind our new Nemotron 3 Ultra NVFP4 checkpoint: we quantized the model into NVFP4 using NVIDIA Model Optimizer. The result is a model that achieves up to 5.9x higher inference throughput than GLM-5.1 754B FP4 model on decode-heavy workloads while matching BF16 accuracy across nearly every benchmark, as shown in Figure 1.\n\nWhile the performance benefits of NVFP4 are well understood, the process of producing a high-quality NVFP4 checkpoint is not. This post walks through how we quantized Nemotron 3 Ultra (550B) to NVFP4 with NVIDIA Model Optimizer, and shows developers how to generate the best quantized checkpoints for their own models.\n\n## The Nemotron 3 Ultra NVFP4 checkpoint\n\nA common misconception is that every layer of an NVFP4 checkpoint is stored in NVFP4. As Table 1 shows, this isn’t the case: different layers are quantized to different precision formats, chosen according to each layer’s sensitivity to the architecture and its impact on model accuracy. After NVFP4 quantization, the Nemotron 3 Ultra model shrinks from 1,121 GB in BF16 down to 352.3 GB, a 3.2x reduction. The payoff is substantial, cutting the hardware footprint in half.\n\nLayer/operator | BF16 baseline | Quantized checkpoint precision |\n| Embedding, Output classification layer, MTP layers | BF16 | BF16 |\n| MoE routed experts | BF16 | NVFP4 |\n| MoE shared experts | BF16 | FP8 per-tensor |\n| Mamba mixer linears | BF16 | FP8 per-tensor |\n| Attention linears | BF16 | BF16 |\n| Latent MoE | BF16 | BF16 |\n| Mamba conv1d | BF16 | BF16 |\n| KV cache | BF16 | FP8 |\n| Mamba SSM cache | FP32 | FP16 with stochastic rounding |\n\n*Table 1. BF16 baseline compared to the quantized checkpoint precision for each layer/operator**from the Nemotron 3 Ultra paper*A key innovation of the Nemotron 3 Ultra NVFP4 is that a single checkpoint can run on both NVIDIA Hopper and Blackwell. It achieves this by converting the weight format to match the hardware it runs on. On Hopper, which lacks native FP4 tensor cores, the serving framework automatically switches to W4A16. On Blackwell, it uses native W4A4.\n\nWhile W8A8 (8-bit weights, 8-bit activations) seems like the obvious Hopper choice, its larger memory footprint leaves too little headroom to fit Multi-Token Prediction (MTP). We found MTP could only fit alongside W4A16 (4-bit weights, 16-bit activations) so W4A16 matches or beats it across the board. Read the full [ Nemotron 3 Ultra technical report (Section 4.6)](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf) to learn more.\n\n## How we found the optimal NVFP4 checkpoint\n\nFinding an optimal NVFP4 checkpoint requires some iterations. We dive into the developer story of how we got an NVFP4 checkpoint in this section.\n\n### The challenge of quantizing at FP4\n\nWith FP4 quantization, there are only 8 positive values [0, 0.5, 1, 1.5, 2, 3, 4, and 6] to represent an entire block of weights. We need to determine how to map the original range of values. This is controlled by a *scale*, essentially a multiplier that determines the granularity of the representation. Choosing a poor scale means we either waste precision on small values or clip large values, both of which hurt model quality. So how should we choose the optimal scale factor? There are several approaches.\n\n#### Max scaling\n\nHere, we set the scale so the largest value in the block maps to the maximum representable FP4 value. However, with the presence of a single large weight outlier, the max scaling compresses every other value in the block into a narrow range, which can end up flushing these values to zero. This information loss may adversely affect accuracy. Max scaling preserves the highest magnitude value in the block, with a potential side effect of flushing other values to zero.\n\n**Try it with NVIDIA Model Optimizer:**\n\n```\n# W4A4 — weights + activations to NVFP4 (default, max scaling)\nmodel = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop=forward_loop)\n```\n\nMax scaling (also called `absmax`\n\n, since the scale is set entirely by the block’s absolute maximum) is the simplest option, but that sensitivity to outliers makes it rarely the best one.\n\nThis is exactly the gap we hit on our prior model, NVIDIA Nemotron 3 Super: naive `absmax`\n\nNVFP4 PTQ left an accuracy gap, so the team evaluated a range of alternative calibration strategies that don’t let a single outlier dictate the scale, from mean squared error (MSE)-based weight scaling to [GPTQ](https://arxiv.org/abs/2210.17323), an efficient method that uses second-order information to encode weights.\n\nAlgorithm | Details | MMLU-Pro | GPQA | LiveCodeBench | AA-LCR |\n| BF16 | — | 83.49 | 79.92 | 72.907 | 53.00 |\n| Default NVFP4 PTQ (Baseline algorithm) | Static per-tensor scales are computed using max-value calibration; per-block scales are computed dynamically from block maximum values. | 82.99 | 79.29 | 70.18 | 55.50 |\n| Weight per-block scales minimizing MSE | Weight per-block scales are swept to minimize per-block MSE. | 83.31 | 79.92 | 71.37 | 56.75 |\n| Weight per-block scales to minimize output MSE | Weight per-block scales are swept independently to minimize GEMM output MSE. | 83.05 | 78.98 | 71.00 | 57.06 |\n| GPTQ | GPTQ (Frantar et al., 2023) is used for weight quantization. | 83.11 | 80.05 | 69.79 | 57.87 |\n\n*Table 2. Experiment results for Nemotron 3 Super quantization. The team tried several quantization methods and evaluated the accuracy change across four tasks. For more information, see the*\n\n[Nemotron 3 Super paper](https://arxiv.org/abs/2604.12374)### Mean squared error scaling\n\nAnother approach is mean squared error (MSE) scaling, which searches for the scale that minimizes average reconstruction error across the whole block.\n\nHowever, lower MSE does not always translate to better model accuracy. MSE calibration reduced per-tensor weight error by 27.1% over four-over-six scaling in our Nemotron 3 Ultra experiments, yet produced no consistent improvement on downstream benchmarks.\n\n**Try MSE-based scaling with NVIDIA Model Optimizer: **\n\n```\nmodel = mtq.quantize(model, mtq.NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG, forward_loop=forward_loop)\n```\n\nFor our previous model, NVIDIA Nemotron 3 Super, the final quantization recipe combined MSE-based block scaling for weights with a per-tensor FP8 sweep and dynamic max-based scaling for activations. Combining MSE weights with the FP8 activation sweep gave the best accuracy-to-size tradeoff of everything we tried, and it became our optimal NVFP4 configuration for Super.\n\nMax and MSE scaling both pick a scale to minimize overall rounding error, but neither pays attention to where the error comes from on the grid. For Nemotron 3 Ultra, we used a scaling method that chooses the range based on the error from the gap in the grid.\n\n### Four-over-six scaling\n\nRemember how NVFP4 can only represent 8 positive values: 0, 0.5, 1, 1.5, 2, 3, 4, and 6. Notice that after 4, the next value jumps straight to 6. Any weight that falls in that range gets rounded aggressively to either 4 or 6, sometimes incurring over 13% error on a single value.\n\nFour-over-six fixes this as each block of weights independently chooses between scaling to a maximum of M=4 or M=6, picking whichever minimizes reconstruction error. Four-over-six works on weights and falls back to the default NVFP4 on activations.\n\n**When M=6 wins with block: [2, 4, 5.9, 6]**\n\nAt a scale of M=1, values 2, 4, and 6 map exactly onto FP4 grid points and only 5.9 rounds to 6 at negligible cost. Scaling to M=4 pushes 2 to 2.25 and 4 to 4.5, introducing error.\n\n**When M=4 wins with block: [10, 20, 30, 40]**\n\nScaling to M=6 maps 30 to 4.62, which rounds down to 4, a 13% error. Scaling to M=4 instead maps 10, 20, 30, and 40 exactly onto 1, 2, 3, and 4, with zero rounding error across the entire block. MSE: 4.33 vs 0.0.\n\nFour-over-six was used to set the FP4 routed-expert weight scales in Nemotron 3 Ultra, raising the global per-tensor weight scale by 1.75x, and with each microblock picking the M=4 or M=6 grid. Across all 49,152 projection weights in the model’s 48 MoE expert layers, it cut the median reconstruction MSE by 16.4% compared to standard max calibration, and delivered the best downstream result in the balanced 5.03-BPE setting: 98.5% median recovery relative to BF16, ahead of max (96.8%) and MSE (98.4%).\n\n**Try four-over-six with NVIDIA Model Optimizer: **\n\n```\nmodel = mtq.quantize(model,\nmtq.NVFP4_FOUR_OVER_SIX_CFG, forward_loop=forward_loop)\n```\n\n`NVFP4_FOUR_OVER_SIX_CFG`\n\nwill be released on the upcoming 0.46 NVIDIA Model Optimizer in July. View the* Nemotron 3 Ultra PTQ example**.** *\n\n### Bits-per-element\n\nEffective bits-per-element (BPE) refers to the average number of bits required to store all weights of the model. A model with all BF16 weights uses 16 effective bits-per-element, while a half-FP8, half-BF16 model uses only 12. NVFP4 adds per-block and per-tensor scaling overhead, bringing its minimum to 4.5 effective bits-per-element. The per-tensor scale’s 32 bits are amortized across the full tensor and is assumed to be negligible in the overall BPE calculation.\n\nThe goal is to search for the quantization configuration that pushes effective BPE as low as possible without sacrificing accuracy. This is tricky because layers are not equally robust. Some are sensitive to quantization and must stay in higher precision, which raises the effective BPE. Since each layer can be quantized at a different level or left unquantized, the number of possible combinations grows exponentially, making an exhaustive search impractical and a smarter strategy necessary.\n\nNVIDIA Model Optimizer AutoQuantize (`mtq.auto_quantize`\n\n) does it for you. Instead of a fixed config, you give it a target bit budget (for example `auto_quantize_bits=4.8`\n\n) and a list of candidate formats, such as `NVFP4_DEFAULT_CFG`\n\nand `FP8_DEFAULT_CFG`\n\n. It then scores each layer’s sensitivity and searches for the per-layer format assignment that meets the budget at the best accuracy, keeping the most sensitive layers in the higher-precision format or skipping them entirely.\n\n``` python\nimport modelopt.torch.quantization as mtq\n\nmodel, search_state = mtq.auto_quantize(\n    model,\n    constraints={\"auto_quantize_bits\": 4.8},\n    quantization_formats=[\"NVFP4_DEFAULT_CFG\", \"FP8_DEFAULT_CFG\"],\n    data_loader=calib_dataloader,\n    forward_step=forward_step,\n    loss_func=loss_func,\n```\n\nTo find the right bits-per-element for Nemotron 3 Ultra, we swept over five operating points ranging from 4.85 to 7.19 effective bits-per-element, comparing accuracy over several benchmarks in Table 3. The key signal came from AA-LCR, where going from 4.85 to 5.03 improved the benchmark by 2.4 points, and benchmark performance then flattened again beyond 5.03. This makes 5.03 BPE the sweet spot.\n\nQuantization (bits-per-element) | ||||||\nTask | Metric | 4.85 | 5.03† | 5.25 | 5.43 | 7.19 |\nCoding | ||||||\n| SciCode | pass@1 (avg-16), subtask acc | 43.82 | 43.88 | 43.45 | 43.27 | 43.44 |\nScientific Reasoning | ||||||\n| GPQA Diamond | pass@1 (avg-32), sym. correct | 84.66 | 84.33 | 84.75 | 84.12 | 84.52 |\n| HLE | pass@1, judge correct | 24.24 | 24.84 | 25.00 | 24.98 | 25.44 |\n| CritPt | pass@1 (avg-8), accuracy | 3.04 | 3.93 | 5.18 | 4.82 | 4.46 |\nGeneral | ||||||\n| AA-Omniscience | pass@1 (avg-20), judge correct | 29.21 | 29.75 | 29.18 | 29.29 | 29.00 |\npass@1 (avg-20), non-hallucination | 54.13 | 51.59 | 51.84 | 51.70 | 52.81 | |\n| IFBench | pass@1 (avg-8), avg. score | 79.34 | 79.26 | 79.83 | 79.53 | 79.83 |\nLong Context | ||||||\n| AA-LCR | pass@1 (avg-16), judge correct | 62.25 | 64.69 | 64.19 | 64.94 | 65.00 |\n\n*Table 3. Accuracy compared to effective bits-per-element from the*\n\n[Nemotron 3 Ultra paper](https://arxiv.org/abs/2606.15007)## How we quantized Nemotron 3 Ultra to NVFP4 with Model Optimizer\n\nUnlike Nemotron 3 Super 120B, Nemotron 3 Ultra is a 550B model, so it benefits significantly from parallelizing the quantization process. For this reason, we support two quantization paths.\n\nBoth paths are powered by NVIDIA Model Optimizer | ||\nMetric | Hugging Face Transformers | Megatron-LM |\n| Compute | 4 × B300 | 16 × B300; Expert parallelism = data parallelism = 16 |\n| Model loading time | 40 min | < 2 min |\n| Model loading and calibration time | 85 min | 9 min |\n| Export | 42 min | 33 min |\nTotal time | 120 min | 45 min |\n\n*Table 4. Quantization time comparing Hugging Face Transformers to Megatron-LM*\n\nQuantizing Nemotron 3 Ultra to NVFP4 follows the [NVIDIA ModelOpt post-training quantization (PTQ) pipeline in NVIDIA Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt). With the parallel route, the pretrained checkpoint is first converted to Megatron-LM format and then quantized with a single call to `quantize.sh`\n\n, passing an NVFP4 quantization config as the recipe. On the backend, Megatron-LM shards the model across GPUs with expert and data parallelism (`EP = DP = 16`\n\non 16×B300s), so the calibration forward pass runs distributed across all devices. This reduces load and calibration from ~85 minutes to ~9.\n\nCalibration runs `nemotron-post-training-dataset-v2`\n\nto fit the per-block scales, and the precision policy is entirely config-driven. Select it by passing a config to [ quantize.sh](http://quantize.sh/). Either a built-in name (e.g.,\n\n`NVFP4_DEFAULT_CFG`\n\n, `FP8_DEFAULT_CFG`\n\n) or a YAML recipe path, which is what ultimately gets handed to `mtq.quantize(model, config, forward_loop)`\n\nto install the quantizers and run calibration.Try Four-Over-Six Scaling with NVIDIA Model Optimizer:\n\n```\nHF_MODEL_CKPT=nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16\n\n# Step 1 — Quantize to NVFP4\nTP=4 \\\nMLM_MODEL_SAVE=/tmp/Nemotron-3-Ultra_quant \\\n./quantize.sh nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 huggingface/models/nvidia/Nemotron-3-Ultra-550B-A55B/ptq/nvfp4-4o6\n\n# Step 2 — Export the quantized checkpoint\nPP=1 \\\nMLM_MODEL_CKPT=/tmp/Nemotron-3-Ultra_quant \\\nEXPORT_DIR=/tmp/Nemotron-3-Ultra_NVFP4_46_HF \\\n./export.sh nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16\n```\n\n`NVFP4_FOUR_OVER_SIX_CFG`\n\nsupport for four-over-six is landing in NVIDIA Model Optimizer 0.46. The Nemotron 3 Ultra recipe for four-over-six is [available on GitHub](https://github.com/NVIDIA/Model-Optimizer/blob/db5497e2b1a5ab15a65aca9c2f157a56d5d6a276/modelopt_recipes/huggingface/models/nvidia/Nemotron-3-Ultra-550B-A55B/ptq/nvfp4-46-max.yaml). Four-over-six works on weights and falls back to default NVFP4 on activations.\n\n## Customizing quantization configs\n\nNVIDIA Model Optimizer is built to be customizable with different quantization configs. The built-in NVFP4 configs range from `NVFP4_DEFAULT_CFG`\n\n, which quantizes broadly, to more selective presets like `NVFP4_MLP_ONLY_CFG`\n\n, `NVFP4_EXPERTS_ONLY_CFG`\n\n, and `NVFP4_OMLP_ONLY_CFG`\n\nthat restrict FP4 to the MLP and expert layers while keeping the sensitive attention projections in higher precision.\n\nUnder the hood, a config is an ordered list of rules matched against module-name patterns, and `mtq.quantize()`\n\napplies them. Weight quantization is governed by rules targeting the `*weight_quantizer`\n\npattern, where you set the format (for NVFP4, `E2M1`\n\nelements with 16-wide blocks and `E4M3`\n\nblock scales), while activation quantization is governed by separate rules on the `*input_quantizer`\n\npattern.\n\nSince the two are independent, you can quantize weights only or weights and activations together, and you can carve out exceptions for specific modules by appending rules that disable them. For anything beyond the built-in presets, you can write a full YAML recipe and load it with `--recipe`\n\n, which then fully defines the quant config.\n\nThe following Nemotron-3 Ultra recipe applies NVFP4 with four-over-six to the routed-expert weights, keeps the shared experts and Mamba projections in FP8, uses an FP8 KV cache, and leaves everything else in BF16. The complete recipe ships with NVIDIA Model Optimizer’s recipe library:[ nvfp4-4o6.yaml](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/huggingface/models/nvidia/Nemotron-3-Ultra-550B-A55B/ptq/nvfp4-4o6.yaml)\n\n```\n# Nemotron 3 Ultra NVFP4 mixed-precision recipe with Four-Over-Six (4/6)\n# Example recipe for HuggingFace models, for Megatron-compatible recipe see the full recipe link\n\nquantize:\n  algorithm:\n    method: mse\n    fp8_scale_sweep: false\n    start_multiplier: 1.0   # M=6 (keep amax)\n    stop_multiplier: 1.5    # M=4 (amax x 6/4)\n    step_size: 0.5          # candidates [1.0, 1.5]\n\n  quant_cfg:\n    # Disable everything by default; later rules re-enable specific modules.\n    - quantizer_name: '*'\n      enable: false\n\n    # MoE routed experts -> NVFP4 W4A4, block 16, e4m3 block scale.\n    # 4/6 adaptive block scaling on weights only; not actvivations\n    # HF names: backbone.layers.*.mixer.experts.*.{up,down}_proj\n    - quantizer_name: '*mixer.experts.*weight_quantizer'\n      enable: true\n      cfg:\n        block_sizes: {-1: 16, type: static, scale_bits: e4m3, four_over_six: true}\n        num_bits: e2m1\n    - quantizer_name: '*mixer.experts.*input_quantizer'\n      enable: true\n      cfg:\n        block_sizes: {-1: 16, type: dynamic, scale_bits: e4m3}\n        num_bits: e2m1\n\n    # Shared experts + Mamba in/out_proj -> FP8 per-tensor (weights+activations).\n    - quantizer_name: '*mixer.shared_experts*'\n      enable: true\n      cfg: {num_bits: e4m3, axis: null}\n    - quantizer_name: '*mixer.in_proj*'\n      enable: true\n      cfg: {num_bits: e4m3, axis: null}\n    - quantizer_name: '*mixer.out_proj*'\n      enable: true\n      cfg: {num_bits: e4m3, axis: null}\n\n    # KV cache -> FP8.\n    - quantizer_name: '*[kv]_bmm_quantizer'\n      enable: true\n      cfg: {num_bits: e4m3}\n```\n\nWhile we walked through this on Nemotron 3 Ultra, **the same pipeline works with any Hugging Face model checkpoint. **Simply point Model Optimizer at a model card from the Hub or a local path, pick a config (a built-in preset or your own recipe), and run the same quantize and export steps.\n\n``` python\nimport modelopt.torch.quantization as mtq\nfrom modelopt.torch.export import export_hf_checkpoint\nfrom transformers import AutoModelForCausalLM\n\nmodel = AutoModelForCausalLM.from_pretrained(\"<your-hf-model-card>\")\n\n# Calibrate + quantize with the config of your choice\nmodel = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop)\n\n# Export a unified HF checkpoint for TRT-LLM / vLLM / SGLang\nexport_hf_checkpoint(model, export_dir=\"<export_path>\")\n```\n\n**Try the one-click launcher**\n\nTo simplify deployment, the [Model Optimizer launcher](https://github.com/NVIDIA/Model-Optimizer/tree/main/tools/launcher) automates the entire Ultra PTQ and export workflow. After completing the setup steps in the launcher README, the workflow can be launched via the [Nemotron 3 Ultra YAML recipe](https://github.com/NVIDIA/Model-Optimizer/blob/main/tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_lm_ptq.yaml) with a single command from a local machine:\n\n```\nuv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_lm_ptq.yaml --yes\n```\n\nOnce launched, the workflow handles the remaining quantization and export steps automatically, assuming access to a Slurm cluster with sufficient GPU resources. This example was validated on four nodes, each equipped with four NVIDIA Blackwell GPUs.\n\nFor smaller-scale deployments, a PTQ example is also available for Nemotron-3 Super:\n\n```\nuv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16\n```\n\n## Get started\n\nThis process can be reproduced using the full recipe available in the open-source NVIDIA Model Optimizer GitHub repository. Model Optimizer is a community-driven project, and contributions are encouraged. Issues can be filed to report bugs or request features, the project roadmap can be reviewed for upcoming work, and pull requests can be submitted to contribute improvements. See `CONTRIBUTING.md`\n\nfor contribution guidelines and getting-started information.\n\nLearn more with the following resources:\n\n[NVIDIA Model Optimizer GitHub](https://github.com/NVIDIA/Model-Optimizer)[NVIDIA Model Optimizer LLM PTQ Documentation](https://github.com/NVIDIA/Model-Optimizer/tree/47a33db9b639c5d25baaa3e49526f9703491eb08/examples/llm_ptq)[NVIDIA Model Optimizer with Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt)[Nemotron 3 Ultra Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf)[Nemotron 3 Ultra NVFP4 Checkpoint](https://build.nvidia.com/nvidia/nemotron-3-ultra-550b-a55b/modelcard)\n\n## Acknowledgments\n\n*This work would not have been possible without the close collaboration between the NVIDIA Model Optimizer team and the Nemotron team. We thank the engineers across both teams who contributed to the quantization pipeline, evaluation infrastructure, and model training. Special thanks to the Megatron-LM team for enabling distributed quantization at scale, and to the Nemotron team for the benchmark suite used to validate the FP4 recipes. We also thank the broader NVIDIA Research and Applied Deep Learning teams for their continued support and feedback throughout this project.*\n\n*In particular, we thank Asma Kuriparambil Thekkumpate, Jenny Chen, and Jinhang Choi for leading the implementation of the NVFP4 quantization on Nemotron 3 Ultra. *", "url": "https://wpnews.pro/news/creating-the-nvidia-nemotron-3-ultra-nvfp4-checkpoint-with-nvidia-model", "canonical_source": "https://developer.nvidia.com/blog/creating-the-nvidia-nemotron-3-ultra-nvfp4-checkpoint-with-nvidia-model-optimizer/", "published_at": "2026-06-26 16:00:35+00:00", "updated_at": "2026-06-26 16:16:45.647403+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-research", "ai-products"], "entities": ["NVIDIA", "Nemotron 3 Ultra", "NVFP4", "NVIDIA Model Optimizer", "Hopper", "Blackwell", "GLM-5.1"], "alternates": {"html": "https://wpnews.pro/news/creating-the-nvidia-nemotron-3-ultra-nvfp4-checkpoint-with-nvidia-model", "markdown": "https://wpnews.pro/news/creating-the-nvidia-nemotron-3-ultra-nvfp4-checkpoint-with-nvidia-model.md", "text": "https://wpnews.pro/news/creating-the-nvidia-nemotron-3-ultra-nvfp4-checkpoint-with-nvidia-model.txt", "jsonld": "https://wpnews.pro/news/creating-the-nvidia-nemotron-3-ultra-nvfp4-checkpoint-with-nvidia-model.jsonld"}}