Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
download_example_onnx.py	download_example_onnx.py
evaluate.py	evaluate.py
evaluation.py	evaluation.py
image_prep.py	image_prep.py
llm_export.py	llm_export.py
requirements.txt	requirements.txt
torch_quant_to_onnx.py	torch_quant_to_onnx.py

ONNX Post-training quantization (PTQ)

This ONNX PTQ Toolkit provides a comprehensive suite of tools designed to optimize ONNX (Open Neural Network Exchange) models through quantization. Our toolkit is aimed at developers looking to enhance performance, reduce model size, and accelerate inference times without compromising the accuracy of their neural networks when deployed with TensorRT.

Quantization is an effective model optimization technique that compresses your models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality.

Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4 and supports advanced algorithms such as AWQ and Double Quantization with easy-to-use Python APIs.

Section	Description	Link	Docs
Pre-Requisites	Required & optional packages to use this technique	Link
Getting Started	Learn how to optimize your models using PTQ to reduce precision and improve inference efficiency	Link	docs
Support Matrix	View the ONNX export supported LLM models	Link
PyTorch to ONNX	Example scripts demonstrating how to quantize with PyTorch and then convert to ONNX	Link
Advanced Features	Examples demonstrating use advanced ONNX quantization features	Link
Pre-Quantized Checkpoints	Ready to deploy Hugging Face pre-quantized checkpoints	Link
Resources	Extra links to relevant resources	Link

Pre-Requisites

Docker

Please use the TensorRT docker image (e.g., nvcr.io/nvidia/tensorrt:25.08-py3) or visit our installation docs for more information.

Set the following environment variables inside the TensorRT docker.

export CUDNN_LIB_DIR=/usr/lib/x86_64-linux-gnu/
export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${LD_LIBRARY_PATH}"

Also follow the installation steps below to upgrade to the latest version of Model Optimizer and install example-specific dependencies.

Local Installation

Install Model Optimizer with onnx dependencies using pip from PyPI and install the requirements for the example:

pip install -U nvidia-modelopt[onnx]
pip install -r requirements.txt

For TensorRT Compiler framework workloads:

Install the latest TensorRT from here.

Getting Started

Prepare the example model

Most of the examples in this doc use vit_base_patch16_224.onnx as the input model. The model can be downloaded with the following script:

python download_example_onnx.py \
    --vit \
    --onnx_save_path=vit_base_patch16_224.onnx \
    --fp16 # <Optional, if the desired output ONNX precision is FP16>

Prepare calibration data

Calibration data is a representative subset of your training or validation dataset used during quantization to determine the optimal scale factors for converting floating-point values to lower precision formats (INT8, FP8, INT4). This data helps maintain model accuracy after quantization by analyzing the distribution of activations throughout the network.

First, prepare some calibration data. TensorRT recommends calibration data size to be at least 500 for CNN and ViT models. The following command picks up 500 images from the tiny-imagenet dataset and converts them to a numpy-format calibration array. Reduce the calibration data size for resource constrained environments.

python image_prep.py \
    --calibration_data_size=500 \
    --output_path=calib.npy \
    --fp16 # <Optional, if the input ONNX is in FP16 precision>

For Int4 quantization, it is recommended to set --calibration_data_size=64.

Quantize ONNX Model to FP8, INT8 or INT4

The model can be quantized as an FP8, INT8 or INT4 model using either the CLI or Python API. For FP8 and INT8 quantization, you have a choice between max and entropy calibration algorithms. For INT4 quantization, awq_clip or rtn_dq algorithms can be chosen.

For NVFP4 and MXFP8 ONNX, see the PyTorch to ONNX section.

Minimum opset requirements: int8 (13+), fp8 (21+), int4 (21+). ModelOpt will automatically upgrade lower opset versions to meet these requirements.

Option 1: Command-line interface

python -m modelopt.onnx.quantization \
    --onnx_path=vit_base_patch16_224.onnx \
    --quantize_mode=<fp8|int8|int4> \
    --calibration_data=calib.npy \
    --calibration_method=<max|entropy|awq_clip|rtn_dq> \
    --output_path=vit_base_patch16_224.quant.onnx

Option 2: Python API

from modelopt.onnx.quantization import quantize

quantize(
    onnx_path="vit_base_patch16_224.onnx",
    quantize_mode="int8",       # fp8, int8, int4 etc.
    calibration_data="calib.npy",
    calibration_method="max",   # max, entropy, awq_clip, rtn_dq etc.
    output_path="vit_base_patch16_224.quant.onnx",
)

Evaluate the quantized ONNX model

The following evaluation requires the val directory of the ImageNet dataset. Alternatively, you can prepare it from this Hugging Face dataset. Once you have it, the quantized ONNX ViT model can be evaluated on the ImageNet dataset as follows:

python evaluate.py \
    --onnx_path=<path to classification model> \
    --imagenet_path=<path to the ImageNet dataset> \
    --engine_precision=stronglyTyped \
    --model_name=vit_base_patch16_224

This script converts the quantized ONNX model to a TensorRT engine and does the evaluation with that engine. Finally, the evaluation result will be reported as follows:

The top1 accuracy of the model is <accuracy score between 0-100%>
The top5 accuracy of the model is <accuracy score between 0-100%>
Inference latency of the model is <X> ms

Torch quantization to ONNX example for MXFP8, INT4 or NVFP4 precision

This example demonstrates how to quantize a timm vision model using MXFP8, INT4 or NVFP4 precision formats, and then export it to ONNX. The script leverages the ModelOpt toolkit for both quantization and ONNX export.

Opset 20 is used to export the torch models to ONNX.

What it does

Loads a pretrained timm torch model (default: ViT-Base).
Quantizes the torch model to MXFP8, INT4 or NVFP4 using ModelOpt.
Exports the quantized model to ONNX.
Postprocesses the ONNX model to be compatible with TensorRT.
Saves the final ONNX model.

Usage

python torch_quant_to_onnx.py \
    --timm_model_name=vit_base_patch16_224 \
    --quantize_mode=<mxfp8|nvfp4|int4_awq> \
    --onnx_save_path=<path to save the exported ONNX model>

Evaluation

If the input model is of type image classification, use the following script to evaluate it.

Note: TensorRT 10.11 or later is required to evaluate the MXFP8 or NVFP4 ONNX models.

python evaluate.py \
    --onnx_path=<path to the exported ONNX model> \
    --imagenet_path=<path to the ImageNet dataset> \
    --engine_precision=stronglyTyped \
    --model_name=vit_base_patch16_224

ONNX Export Supported LLM Models

Model	FP16	INT4	FP8	NVFP4
Llama-3-8B-Instruct	✅	✅	✅	✅
Llama3.1-8B	✅	✅	✅	✅
Llama3.2-3B	✅	✅	✅	✅
Qwen2-0.5B-Instruct	✅	✅	✅	✅
Qwen2-1.5B-Instruct	✅	✅	✅	✅
Qwen2-7B-Instruct	✅	✅	✅	✅
Qwen2.5-0.5B-Instruct	✅	✅	✅	✅
Qwen2.5-1.5B-Instruct	✅	✅	✅	✅
Qwen2.5-3B-Instruct	✅	✅	✅	✅
Qwen2.5-7B-Instruct	✅	✅	✅	✅

Advanced Features

Per node calibration of ONNX models

Per node calibration is a memory optimization feature designed to reduce memory consumption during quantization of large ONNX models. Instead of running inference over the entire network at once, this feature processes the model node-by-node, which can significantly reduce peak memory usage and prevent out-of-memory (OOM) errors.

How it works

When per node calibration is enabled, the quantization process:

Decomposes the model: Splits the original ONNX model into multiple single-node sub-models
Manages dependencies: Tracks input/output dependencies between nodes to ensure correct execution order
Processes sequentially: Runs calibration on each node individually using a topological processing order
Manages memory: Automatically cleans up intermediate results and manages reference counting to minimize memory usage
Aggregates results: Combines calibration data from all nodes to produce the final quantized model

When to use per node calibration

Per node calibration is particularly beneficial for:

Large models that cause OOM errors during standard calibration
Memory-constrained environments where GPU memory is limited
Models with complex architectures that have high intermediate memory requirements

Usage

To enable per node calibration, add the --calibrate_per_node flag to your quantization command:

python -m modelopt.onnx.quantization \
    --onnx_path=vit_base_patch16_224.onnx \
    --quantize_mode=<int8/fp8> \
    --calibration_data=calib.npy \
    --calibrate_per_node \
    --output_path=vit_base_patch16_224.quant.onnx

Note: Per node calibration is not available for INT4 quantization methods (awq_clip, rtn_dq)

Quantize an ONNX model with custom op

This feature requires TensorRT 10+ and ORT>=1.20. For proper usage, please make sure that the paths to libcudnn*.so and TensorRT lib/ are in the LD_LIBRARY_PATH env variable and that the tensorrt python package is installed.

Please see the sample example below.

Step 1: Obtain the sample ONNX model and TensorRT plugin from TensorRT-Custom-Plugin-Example.

1.1. Change directory to TensorRT-Custom-Plugin-Example:

cd /path/to/TensorRT-Custom-Plugin-Example

1.2. Compile the TensorRT plugin:

cmake -B build \
    -DNVINFER_LIB=$TRT_LIBPATH/libnvinfer.so.10 \
    -DNVINFER_PLUGIN_LIB=$TRT_LIBPATH/libnvinfer_plugin.so.10 \
    -DNVONNXPARSER_LIB=$TRT_LIBPATH/libnvonnxparser.so.10 \
    -DCMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES=/usr/include/x86_64-linux-gnu

cmake --build build --config Release --parallel

This generates a plugin in TensorRT-Custom-Plugin-Example/build/src/plugins/IdentityConvIPluginV2IOExt/libidentity_conv_iplugin_v2_io_ext.so

1.3. Create the ONNX file.

python scripts/create_identity_neural_network.py

This generates the identity_neural_network.onnx model in TensorRT-Custom-Plugin-Example/data/identity_neural_network.onnx

Step 2: Quantize the ONNX model. We will be using the libidentity_conv_iplugin_v2_io_ext.so plugin for this example.

python -m modelopt.onnx.quantization \
    --onnx_path=/path/to/identity_neural_network.onnx \
    --trt_plugins=/path/to/libidentity_conv_iplugin_v2_io_ext.so

Step 3: Deploy the quantized model with TensorRT.

trtexec --onnx=/path/to/identity_neural_network.quant.onnx \
    --staticPlugins=/path/to/libidentity_conv_iplugin_v2_io_ext.so

Pre-Quantized Checkpoints

Ready-to-deploy checkpoints that can be exported to ONNX format (if supported as per the Support Matrix) [🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection]

Resources

Technical Resources

There are many quantization schemes supported in the example scripts:

The FP8 format is available on the Hopper and Ada GPUs with CUDA compute capability greater than or equal to 8.9.
The INT4 AWQ is an INT4 weight only quantization and calibration method. INT4 AWQ is particularly effective for low batch inference where inference latency is dominated by weight loading time rather than the computation time itself. For low batch inference, INT4 AWQ could give lower latency than FP8/INT8 and lower accuracy degradation than INT8.
The NVFP4 is one of the new FP4 formats supported by NVIDIA Blackwell GPU and demonstrates good accuracy compared with other 4-bit alternatives. NVFP4 can be applied to both model weights as well as activations, providing the potential for both a significant increase in math throughput and reductions in memory footprint and memory bandwidth usage compared to the FP8 data format on Blackwell.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

ONNX Post-training quantization (PTQ)

Pre-Requisites

Docker

Local Installation

Getting Started

Prepare the example model

Prepare calibration data

Quantize ONNX Model to FP8, INT8 or INT4

Option 1: Command-line interface

Option 2: Python API

Evaluate the quantized ONNX model

Torch quantization to ONNX example for MXFP8, INT4 or NVFP4 precision

What it does

Usage

Evaluation

ONNX Export Supported LLM Models

Advanced Features

Per node calibration of ONNX models

How it works

When to use per node calibration

Usage

Quantize an ONNX model with custom op

Pre-Quantized Checkpoints

Resources

Technical Resources

FilesExpand file tree

onnx_ptq

Directory actions

More options

Directory actions

More options

Latest commit

History

onnx_ptq

Folders and files

parent directory

README.md

ONNX Post-training quantization (PTQ)

Pre-Requisites

Docker

Local Installation

Getting Started

Prepare the example model

Prepare calibration data

Quantize ONNX Model to FP8, INT8 or INT4

Option 1: Command-line interface

Option 2: Python API

Evaluate the quantized ONNX model

Torch quantization to ONNX example for MXFP8, INT4 or NVFP4 precision

What it does

Usage

Evaluation

ONNX Export Supported LLM Models

Advanced Features

Per node calibration of ONNX models

How it works

When to use per node calibration

Usage

Quantize an ONNX model with custom op

Pre-Quantized Checkpoints

Resources

Technical Resources