Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT NVIDIA released a workflow for converting FP8-quantized CLIP model checkpoints into TensorRT engines, enabling faster inference and higher GPU throughput for production deployment. The process involves exporting the quantized checkpoint to ONNX format and compiling it into a TensorRT engine, with the resulting FP8 engine delivering measurable speed improvements over the FP16 baseline. Converting a quantized checkpoint into an NVIDIA TensorRT engine bridges the gap between model optimization and production deployment, enabling faster inference, higher throughput, and more efficient GPU utilization at scale. In a previous post https://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/ , we produced a high-quality FP8-quantized Contrastive Language-Image Pretraining CLIP checkpoint with NVIDIA TensorRT Model Optimizer. This post picks up where we left off, walking through how to export the checkpoint to ONNX and compile it into an NVIDIA TensorRT engine ready for production inference. We also profile the resulting FP8 TensorRT engine against the FP16 baseline to measure the real-world speedup the quantized model delivers. Figure 1 shows the five stages of a typical end-to-end quantization workflow. This is the standard pipeline for deploying a quantized CLIP model. Quantized LLMs follow a different path through TensorRT-LLM https://docs.nvidia.com/tensorrt-llm/index.html , which is covered in this tutorial https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm ptq exporting-checkpoints . Export model to ONNX format The first step is to export the ModelOpt checkpoint to ONNX. The following pseudo-code does this for the FP8-quantized CLIP checkpoint using a built-in helper from Modelopt the export targets ONNX opset 20+, where FP8 QuantizeLinear/DequantizeLinear is fully supported . It folds each weight-side quantize-then-dequantize Q-DQ pair into an FP8-stored DQ-only chain, noticeably shrinking the ONNX file. In principle native torch.onnx.export works too, but requires us to write a custom conversion script. python import torch from transformers import CLIPModel, CLIPTokenizer from transformers.models.clip.modeling clip import CLIPAttention import modelopt.torch.opt as mto import modelopt.torch.quantization as mtq from modelopt.torch. deploy.utils import OnnxBytes, get onnx bytes and metadata from modelopt.torch.quantization.plugins.diffusion.diffusers import QuantAttention Thin wrappers expose a single forward to the ONNX exporter class TextEncoder torch.nn.Module : def init self, m : super . init ; self.m = m def forward self, x : return self.m.get text features x class ImageEncoder torch.nn.Module : def init self, m : super . init ; self.m = m def forward self, x : return self.m.get image features x def prepare for fp8 onnx export model : 1 turn on FP8 attention fusion off by default, lost on reload . 2 clear CLIP's float scale — exporter chokes on it. for , mod in model.named modules : if isinstance mod, QuantAttention : mod. disable fp8 mha = False if isinstance mod, CLIPAttention and getattr mod, "scale", None is not None: mod.scale = None def export wrapper, dummy, axis name, out name : """ModelOpt's exporter folds Q+DQ on weights into FP8-stored DQ-only chains and rewrites TRT custom ops to native ONNX QDQ — output is TRT-ready.""" onnx bytes, = get onnx bytes and metadata model=wrapper, dummy input= dummy, , model name=out name, dynamic axes={axis name: {0: "batch"}}, onnx opset=20, weights dtype="fp16", OnnxBytes.from bytes onnx bytes .write to disk "./onnx output", clean dir=False Restore the FP8-quantized CLIPModel from the ModelOpt checkpoint mto.enable huggingface checkpointing mtq.QuantModuleRegistry.register {CLIPAttention: "CLIPAttention"} QuantAttention model = CLIPModel.from pretrained modelopt ckpt, attn implementation="sdpa", torch dtype=torch.float16 .eval .cuda prepare for fp8 onnx export model Export Text encoder to ONNX tok = CLIPTokenizer.from pretrained model ckpt dummy text = tok "a photo of a cat" , return tensors="pt", padding="max length", max length=77 "input ids" .cuda export TextEncoder model , dummy text, "text input", "text clip fp8" Export Image encoder to ONNX dummy image = torch.randn 16, 3, 224, 224, dtype=torch.float16 .cuda export ImageEncoder model , dummy image, "image input", "image clip fp8" | Model component | FP8 Modelopt checkpoint | FP16 HuggingFace checkpoint | Size reduction | | CLIP text encoder ONNX | 156 MB | 237 MB | ~34% | | CLIP image encoder ONNX | 292 MB | 582 MB | ~50% | Table 1. CLIP ONNX model size: FP8 vs FP16 Table 1 compares the ONNX file sizes of the FP8 ModelOpt checkpoint export against the original FP16 HuggingFace checkpoint export. The FP8 checkpoint export produces noticeably smaller ONNX files, ~34% smaller for the text encoder and ~50% smaller for the image encoder. Note that shrinking the ONNX file is a convenience, not a requirement. TensorRT folds the weight-side Q node into the FP8 weight at engine-build time. ModelOpt ONNX exporter folds earlier on the ONNX side to keep the on-disk file smaller. We can inspect the exported ONNX file with the NVIDIA Nsight Deep Learning Designer https://developer.nvidia.com/nsight-dl-designer , an efficient tool for ONNX model editing, performance profiling, and TensorRT engine building. Figure 2 shows a portion of the exported ONNX graph visualized in Nsight Deep Learning Designer. We can see that the graph now contains QuantizeLinear/ DequantizeLinear Q/DQ nodes, marking the FP8 boundaries. During engine building, TensorRT fuses these nodes with adjacent layers to optimize inference performance. This fusion eliminates unnecessary quantize-then-dequantize transitions, enabling the use of optimized FP8 kernels for computation. Profile ONNX model with TensorRT With the FP8 ONNX model exported, the next step is to pass it to TensorRT and measure how fast it runs. Before we begin, make sure TensorRT is properly downloaded and installed by following this tutorial https://docs.nvidia.com/deeplearning/tensorrt/latest/installing-tensorrt/installing.html installing-tensorrt . Once ready, we will use trtexec TensorRT command-line wrapper https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec to benchmark the ONNX model with the following command: Set up the TensorRT environment export PATH=