This ONNX PTQ Toolkit provides a comprehensive suite of tools designed to optimize ONNX (Open Neural Network Exchange) models through quantization. Our toolkit is aimed at developers looking to enhance performance, reduce model size, and accelerate inference times without compromising the accuracy of their neural networks when deployed with TensorRT.
Quantization is an effective model optimization technique that compresses your models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality.
Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4 and supports advanced algorithms such as AWQ and Double Quantization with easy-to-use Python APIs.
| Section | Description | Link | Docs |
|---|---|---|---|
| Pre-Requisites | Required & optional packages to use this technique | Link | |
| Getting Started | Learn how to optimize your models using PTQ to reduce precision and improve inference efficiency | Link | docs |
| Support Matrix | View the ONNX export supported LLM models | Link | |
| PyTorch to ONNX | Example scripts demonstrating how to quantize with PyTorch and then convert to ONNX | Link | |
| Advanced Features | Examples demonstrating use advanced ONNX quantization features | Link | |
| Pre-Quantized Checkpoints | Ready to deploy Hugging Face pre-quantized checkpoints | Link | |
| Resources | Extra links to relevant resources | Link |
Please use the TensorRT docker image (e.g., nvcr.io/nvidia/tensorrt:25.08-py3) or visit our installation docs for more information.
Set the following environment variables inside the TensorRT docker.
export CUDNN_LIB_DIR=/usr/lib/x86_64-linux-gnu/
export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${LD_LIBRARY_PATH}"Also follow the installation steps below to upgrade to the latest version of Model Optimizer and install example-specific dependencies.
Install Model Optimizer with onnx dependencies using pip from PyPI and install the requirements for the example:
pip install -U nvidia-modelopt[onnx]
pip install -r requirements.txtFor TensorRT Compiler framework workloads:
Install the latest TensorRT from here.
Most of the examples in this doc use vit_base_patch16_224.onnx as the input model. The model can be downloaded with the following script:
python download_example_onnx.py \
--vit \
--onnx_save_path=vit_base_patch16_224.onnx \
--fp16 # <Optional, if the desired output ONNX precision is FP16>Calibration data is a representative subset of your training or validation dataset used during quantization to determine the optimal scale factors for converting floating-point values to lower precision formats (INT8, FP8, INT4). This data helps maintain model accuracy after quantization by analyzing the distribution of activations throughout the network.
First, prepare some calibration data. TensorRT recommends calibration data size to be at least 500 for CNN and ViT models. The following command picks up 500 images from the tiny-imagenet dataset and converts them to a numpy-format calibration array. Reduce the calibration data size for resource constrained environments.
python image_prep.py \
--calibration_data_size=500 \
--output_path=calib.npy \
--fp16 # <Optional, if the input ONNX is in FP16 precision>For Int4 quantization, it is recommended to set
--calibration_data_size=64.
The model can be quantized as an FP8, INT8 or INT4 model using either the CLI or Python API. For FP8 and INT8 quantization, you have a choice between max and entropy calibration algorithms. For INT4 quantization, awq_clip or rtn_dq algorithms can be chosen.
For NVFP4 and MXFP8 ONNX, see the PyTorch to ONNX section.
Minimum opset requirements: int8 (13+), fp8 (21+), int4 (21+). ModelOpt will automatically upgrade lower opset versions to meet these requirements.
python -m modelopt.onnx.quantization \
--onnx_path=vit_base_patch16_224.onnx \
--quantize_mode=<fp8|int8|int4> \
--calibration_data=calib.npy \
--calibration_method=<max|entropy|awq_clip|rtn_dq> \
--output_path=vit_base_patch16_224.quant.onnxfrom modelopt.onnx.quantization import quantize
quantize(
onnx_path="vit_base_patch16_224.onnx",
quantize_mode="int8", # fp8, int8, int4 etc.
calibration_data="calib.npy",
calibration_method="max", # max, entropy, awq_clip, rtn_dq etc.
output_path="vit_base_patch16_224.quant.onnx",
)The following evaluation requires the val directory of the ImageNet dataset. Alternatively, you can prepare it from this Hugging Face dataset. Once you have it, the quantized ONNX ViT model can be evaluated on the ImageNet dataset as follows:
python evaluate.py \
--onnx_path=<path to classification model> \
--imagenet_path=<path to the ImageNet dataset> \
--engine_precision=stronglyTyped \
--model_name=vit_base_patch16_224This script converts the quantized ONNX model to a TensorRT engine and does the evaluation with that engine. Finally, the evaluation result will be reported as follows:
The top1 accuracy of the model is <accuracy score between 0-100%>
The top5 accuracy of the model is <accuracy score between 0-100%>
Inference latency of the model is <X> msThis example demonstrates how to quantize a timm vision model using MXFP8, INT4 or NVFP4 precision formats, and then export it to ONNX. The script leverages the ModelOpt toolkit for both quantization and ONNX export.
Opset 20 is used to export the torch models to ONNX.
- Loads a pretrained timm torch model (default: ViT-Base).
- Quantizes the torch model to MXFP8, INT4 or NVFP4 using ModelOpt.
- Exports the quantized model to ONNX.
- Postprocesses the ONNX model to be compatible with TensorRT.
- Saves the final ONNX model.
python torch_quant_to_onnx.py \
--timm_model_name=vit_base_patch16_224 \
--quantize_mode=<mxfp8|nvfp4|int4_awq> \
--onnx_save_path=<path to save the exported ONNX model>If the input model is of type image classification, use the following script to evaluate it.
Note: TensorRT 10.11 or later is required to evaluate the MXFP8 or NVFP4 ONNX models.
python evaluate.py \
--onnx_path=<path to the exported ONNX model> \
--imagenet_path=<path to the ImageNet dataset> \
--engine_precision=stronglyTyped \
--model_name=vit_base_patch16_224| Model | FP16 | INT4 | FP8 | NVFP4 |
|---|---|---|---|---|
| Llama-3-8B-Instruct | ✅ | ✅ | ✅ | ✅ |
| Llama3.1-8B | ✅ | ✅ | ✅ | ✅ |
| Llama3.2-3B | ✅ | ✅ | ✅ | ✅ |
| Qwen2-0.5B-Instruct | ✅ | ✅ | ✅ | ✅ |
| Qwen2-1.5B-Instruct | ✅ | ✅ | ✅ | ✅ |
| Qwen2-7B-Instruct | ✅ | ✅ | ✅ | ✅ |
| Qwen2.5-0.5B-Instruct | ✅ | ✅ | ✅ | ✅ |
| Qwen2.5-1.5B-Instruct | ✅ | ✅ | ✅ | ✅ |
| Qwen2.5-3B-Instruct | ✅ | ✅ | ✅ | ✅ |
| Qwen2.5-7B-Instruct | ✅ | ✅ | ✅ | ✅ |
Per node calibration is a memory optimization feature designed to reduce memory consumption during quantization of large ONNX models. Instead of running inference over the entire network at once, this feature processes the model node-by-node, which can significantly reduce peak memory usage and prevent out-of-memory (OOM) errors.
When per node calibration is enabled, the quantization process:
- Decomposes the model: Splits the original ONNX model into multiple single-node sub-models
- Manages dependencies: Tracks input/output dependencies between nodes to ensure correct execution order
- Processes sequentially: Runs calibration on each node individually using a topological processing order
- Manages memory: Automatically cleans up intermediate results and manages reference counting to minimize memory usage
- Aggregates results: Combines calibration data from all nodes to produce the final quantized model
Per node calibration is particularly beneficial for:
- Large models that cause OOM errors during standard calibration
- Memory-constrained environments where GPU memory is limited
- Models with complex architectures that have high intermediate memory requirements
To enable per node calibration, add the --calibrate_per_node flag to your quantization command:
python -m modelopt.onnx.quantization \
--onnx_path=vit_base_patch16_224.onnx \
--quantize_mode=<int8/fp8> \
--calibration_data=calib.npy \
--calibrate_per_node \
--output_path=vit_base_patch16_224.quant.onnxNote: Per node calibration is not available for INT4 quantization methods (
awq_clip,rtn_dq)
This feature requires TensorRT 10+ and ORT>=1.20. For proper usage, please make sure that the paths to libcudnn*.so and TensorRT lib/ are in the LD_LIBRARY_PATH env variable and that the tensorrt python package is installed.
Please see the sample example below.
Step 1: Obtain the sample ONNX model and TensorRT plugin from TensorRT-Custom-Plugin-Example.
1.1. Change directory to TensorRT-Custom-Plugin-Example:
cd /path/to/TensorRT-Custom-Plugin-Example1.2. Compile the TensorRT plugin:
cmake -B build \
-DNVINFER_LIB=$TRT_LIBPATH/libnvinfer.so.10 \
-DNVINFER_PLUGIN_LIB=$TRT_LIBPATH/libnvinfer_plugin.so.10 \
-DNVONNXPARSER_LIB=$TRT_LIBPATH/libnvonnxparser.so.10 \
-DCMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES=/usr/include/x86_64-linux-gnucmake --build build --config Release --parallelThis generates a plugin in TensorRT-Custom-Plugin-Example/build/src/plugins/IdentityConvIPluginV2IOExt/libidentity_conv_iplugin_v2_io_ext.so
1.3. Create the ONNX file.
python scripts/create_identity_neural_network.pyThis generates the identity_neural_network.onnx model in TensorRT-Custom-Plugin-Example/data/identity_neural_network.onnx
Step 2: Quantize the ONNX model. We will be using the libidentity_conv_iplugin_v2_io_ext.so plugin for this example.
python -m modelopt.onnx.quantization \
--onnx_path=/path/to/identity_neural_network.onnx \
--trt_plugins=/path/to/libidentity_conv_iplugin_v2_io_ext.soStep 3: Deploy the quantized model with TensorRT.
trtexec --onnx=/path/to/identity_neural_network.quant.onnx \
--staticPlugins=/path/to/libidentity_conv_iplugin_v2_io_ext.so- Ready-to-deploy checkpoints that can be exported to ONNX format (if supported as per the Support Matrix) [🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection]
There are many quantization schemes supported in the example scripts:
-
The FP8 format is available on the Hopper and Ada GPUs with CUDA compute capability greater than or equal to 8.9.
-
The INT4 AWQ is an INT4 weight only quantization and calibration method. INT4 AWQ is particularly effective for low batch inference where inference latency is dominated by weight loading time rather than the computation time itself. For low batch inference, INT4 AWQ could give lower latency than FP8/INT8 and lower accuracy degradation than INT8.
-
The NVFP4 is one of the new FP4 formats supported by NVIDIA Blackwell GPU and demonstrates good accuracy compared with other 4-bit alternatives. NVFP4 can be applied to both model weights as well as activations, providing the potential for both a significant increase in math throughput and reductions in memory footprint and memory bandwidth usage compared to the FP8 data format on Blackwell.