Quantization is an effective model optimization technique that compresses your models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality.
Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4 and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and Double Quantization with easy-to-use Python APIs.
This section focuses on Post-training quantization, a technique that reduces model precision after training to improve inference efficiency without requiring retraining.
| Section | Description | Link | Docs |
|---|---|---|---|
| Pre-Requisites | Required & optional packages to use this technique | [Link] | |
| Getting Started | Learn how to optimize your models using PTQ to reduce precision and improve inference efficiency | [Link] | [docs] |
| Support Matrix | View the support matrix to see quantization compatibility and feature availability across different models | [Link] | |
| AutoQuantize | Automatically chooses layers/precisions for mixed precision quantization to enhanced inference performance and accuracy tradeoffs | [Link] | [docs] |
| Real Quant | Real Quant compresses model weights in a low-precision format to reduce memory requirements of quantization. | [Link] | |
| Framework Scripts | Example scripts demonstrating quantization techniques for optimizing Hugging Face / NeMo / Megatron-LM models | [Link] | |
| Evaluate Accuracy | Evaluate your model's accuracy! | [Link] | |
| Exporting Checkpoints | Export to Hugging Face Unified Checkpoint and deploy on TRT-LLM/vLLM/SGLang | [Link] | [docs] |
| Pre-Quantized Checkpoints | Ready to deploy Hugging Face pre-quantized checkpoints | [Link] | |
| Resources | Extra links to relevant resources | [Link] |
For Hugging Face models, please use the TensorRT-LLM docker image (e.g., nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2).
For NeMo models, use the NeMo container (e.g., nvcr.io/nvidia/nemo:25.07).
Visit our installation docs for more information.
Also follow the installation steps below to upgrade to the latest version of Model Optimizer and install example-specific dependencies.
For Hugging Face models, install Model Optimizer with hf dependencies using pip from PyPI and install the requirements for the example:
pip install -U nvidia-modelopt[hf]
pip install -r requirements.txtFor TensorRT-LLM deployment, please use the TensorRT-LLM docker image or follow their installation docs. Similarly, for vLLM or SGLang deployment, please use their installation docs.
With the simple API below, you can very easily use Model Optimizer to quantize your model. Model Optimizer achieves this by converting the precision of your model to the desired precision, and then using a small dataset (typically 128-512 samples) to calibrate the quantization scaling factors. The accuracy of PTQ is typically robust across different choices of calibration data, by default Model Optimizer uses cnn_dailymail. Users can try other datasets by easily modifying the calib_set.
import modelopt.torch.quantization as mtq
# Setup the model
model = AutoModelForCausalLM.from_pretrained("...")
# Simplified example set up a calibration data loader with the desired calib_size
calib_set = get_dataloader(num_samples=calib_size)
# Prepare the calibration set and define a forward loop
def forward_loop(model):
for batch in calib_set:
model(batch)
# PTQ with in-place replacement to quantized modules
model = mtq.quantize(model, mtq.INT8_SMOOTHQUANT_CFG, forward_loop)Once your model is quantized, you can now export that model to a checkpoint for easy deployment.
We provide two APIs to export the quantized model:
- Unified Hugging Face checkpoints, which can be deployed on TensorRT-LLM (Pytorch and C++ backends), vLLM and SGLang.
- (Legacy) TensorRT-LLM checkpoints, a format that works with TensorRT-LLM C++ backend only.
from modelopt.torch.export import export_hf_checkpoint
with torch.inference_mode():
export_hf_checkpoint(
model, # The quantized model.
export_dir, # The directory where the exported files will be stored.
)The user can specify the inference time TP and PP size and the export API will organize the weights to fit the target GPUs.
from modelopt.torch.export import export_tensorrt_llm_checkpoint
with torch.inference_mode():
export_tensorrt_llm_checkpoint(
model, # The quantized model.
decoder_type, # The type of the model, e.g gpt, gptj, or llama.
dtype, # The exported weights data type.
export_dir, # The directory where the exported files will be stored.
inference_tensor_parallel, # The number of GPUs used in the inference time tensor parallel.
inference_pipeline_parallel, # The number of GPUs used in the inference time pipeline parallel.
use_nfs_workspace, # If exporting in a multi-node setup, please specify a shared directory like NFS for cross-node communication.
)After the TensorRT-LLM checkpoint export, you can use the trtllm-build build command to build the engines from the exported checkpoints. Please check the TensorRT-LLM Build API documentation for reference.
Please reference our framework scripts and our docs for more details.
| Model | fp8 | int8_sq | int4_awq | w4a8_awq1 | nvfp45 |
|---|---|---|---|---|---|
| LLAMA 3.x | ✅ | ❌ | ✅ | ✅3 | ✅ |
| LLAMA 4 6 | ✅ | ❌ | ❌ | ❌ | ✅ |
| Mixtral | ✅ | ❌ | ✅2 | ❌ | ✅ |
| Phi-3,4 | ✅ | ✅ | ✅ | ✅3 | - |
| Phi-3.5 MOE | ✅ | ❌ | ❌ | ❌ | - |
| Llama-Nemotron Super | ✅ | ❌ | ❌ | ❌ | ✅ |
| Llama-Nemotron Ultra | ✅ | ❌ | ❌ | ❌ | ❌ |
| Gemma 3 | ✅2 | - | ✅ | - | - |
| QWen 2, 2.5 4 | ✅ | ✅ | ✅ | ✅ | ✅ |
| QWen3 MOE 6 | ✅ | - | - | - | ✅ |
| QwQ | ✅ | - | - | - | ✅ |
| T5 | ✅ | ✅ | ✅ | ✅ | - |
| Whisper | ✅ | ❌ | ❌ | ❌ | - |
This is a subset of the models supported. For the full list please check the TensorRT-LLM support matrix
1.The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.
2.For some models, there is only support for exporting quantized checkpoints.
3.W4A8_AWQ is only available on some models but not all
4.For some models, KV cache quantization may result in a higher accuracy penalty.
5.A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later
6.Some models currently support export to HF format only.
The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying hf_ptq.py and disabling the KV cache quantization or using the QAT instead.
You can also create your own custom config using this guide.
Please refer to the NeMo 2.0 PTQ documentation for supported models.
AutoQuantize (mtq.auto_quantize) is a PTQ algorithm which quantizes a model by searching for the best quantization format per-layer while meeting performance constraints specified by the user. AutoQuantize streamlines the trade-off of model accuracy and performance.
Currently AutoQuantize supports only auto_quantize_bits as the performance constraint (for both weight-only
quantization and weight & activation quantization). auto_quantize_bits constraint specifies the effective number of bits for the quantized model.
You may specify an auto_quantize_bits constraint such as 4.8 for mixed precision quantization using NVFP4_DEFAULT_CFG & FP8_DEFAULT_CFG.
AutoQuantize will automatically quantize highly sensitive layers in FP8_DEFAULT_CFG while keeping less sensitive layers in NVFP4_DEFAULT_CFG (and even skip quantization for any extremely sensitive layers) so that
the the final mixed precision quantized model has an effective quantized bits of 4.8. This model would give a better accuracy than the model quantized with vanilla NVFP4_DEFAULT_CFG configuration since the more aggressive NVFP4_DEFAULT_CFG quantization was not applied for the highly sensitive layers.
Here is an example usage for AutoQuantize algorithm (Please see auto_quantize API for more details):
import modelopt.torch.quantization as mtq
# Define the model & calibration dataloader
model = ...
calib_dataloader = ...
# Define forward_step function.
# forward_step should take the model and data as input and return the output
def forward_step(model, data):
output = model(data)
return output
# Define loss function which takes the model output and data as input and returns the loss
def loss_func(output, data):
loss = ...
return loss
# Perform AutoQuantize
model, search_state_dict = mtq.auto_quantize(
model,
constraints = {"auto_quantize_bits": 4.8},
# supported quantization formats are listed in `modelopt.torch.quantization.config.choices`
quantization_formats = ["NVFP4_DEFAULT_CFG", "FP8_DEFAULT_CFG"]
data_loader = calib_dataloader,
forward_step=forward_step,
loss_func=loss_func,
...
)AutoQuantize can be performed for Huggingface LLM models like Llama-3 as shown below:
export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or simply the model card>
# --auto_quantize_bits specifies the constraint for `AutoQuantize`
# --quant specifies the formats to be searched for `AutoQuantize`
# NOTE: auto_quantize_bits cannot be lower than the number of bits for the smallest quantization format in --quant
scripts/huggingface_example.sh --type llama --model $HF_PATH --quant w4a8_awq,fp8 --auto_quantize_bits 4.8 --tp [1|2|4|8] --calib_batch_size 4The above example perform AutoQuantize where the less quantization accuracy sensitive layers are quantized with w4a8_awq (specified by --quant w4a8_awq) and the more sensitive layers
are kept un-quantized such that the effective bits is 4.8 (specified by --auto_quantize_bits 4.8).
The example scripts above also have an additional flag --tasks, where the actual tasks run in the script can be customized. The allowed tasks are quant,mmlu,lm_eval,livecodebench specified in the script parser. The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the --lm_eval_tasks flag with comma separated lm_eval tasks here.
If GPU out-of-memory error is reported running the scripts, please try editing the scripts and reducing the max batch size to save GPU memory.
NOTE: AutoQuantize requires backpropagation of the model. Models without backpropagation support (e.g., Llama-4) will not work with AutoQuantize.
When working with large language models, memory constraints can be a significant challenge. ModelOpt provides a workflow for initializing HF models with compressed weights across multiple GPUs to dramatically reduce memory usage. Check --low_memory_mode option in hf_ptq.py for more details.
import modelopt.torch.quantization as mtq
from modelopt.torch.quantization.plugins import init_quantized_weights
from transformers import AutoModelForCausalLM, AutoConfig
# Step 1: Initialize the model with compressed weights
with init_quantized_weights(mtq.NVFP4_DEFAULT_CFG):
model = AutoModelForCausalLM.from_pretrained(ckpt_path)
# Step 2: Calibrate the model
mtq.calibrate(model, algorithm="max", forward_loop=calibrate_loop)Hugging Face Example Script
For LLM models like Llama-3:
# Install model specific pip dependencies if needed
export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or simply the model card>
scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]By default
trust_remote_codeis set to false. Please turn it on if model calibration and eval requires it using--trust_remote_code.
If the Huggingface model calibration fails on a multi-GPU system due to mismatched tensor placement, please try setting CUDA_VISIBLE_DEVICES to a smaller number.
FP8 calibration over a large model with limited GPU memory is not recommended but possible with the accelerate package. Please tune the device_map setting in
example_utils.pyif needed for model loading and the calibration process can be slow.
Huggingface models trained with
modelopt.torch.speculativecan be used as regular Huggingface models in PTQ. Note: there is a known issue with Huggingface models loaded across multiple GPUs for inference (i.e., "Expected all tensors to be on the same device, but found at least two devices..."). When encountered this error in PTQ of speculative decoding models, try reducing the number of GPUs used.
Calibration by default uses left padding_side for the Huggingface tokenizer as it usually leads to lower accuracy loss. The exported tokenizer files restores the default padding_side.
If a GPU OOM error occurs during model quantization despite sufficient memory, setting the --use_seq_device_map flag can help. This enforces sequential device mapping, distributing the model across GPUs and utilizing up to 80% of each GPU's memory.
You can add
--low_memory_modeto the command to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.
PTQ for DeepSeek shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.
NeMo 2.0 framework PTQ and TensorRT-LLM deployment examples are maintained in the NeMo GitHub repo. Please refer to the NeMo PTQ documentation for more details.
Megatron-LM framework PTQ and TensorRT-LLM deployment examples are maintained in the Megatron-LM GitHub repo. Please refer to the examples here.
A list of accuracy validation benchmarks are provided in the llm_eval directory. Right now MMLU, and MTbench are supported in this example by specifying the --tasks flag running the scripts mentioned above. For MTBench, the task only runs the answer generation stage. Please follow fastchat to get the evaluation judge score.
The benchmark_suite.py script is used as a fast performance benchmark. For details, please refer to the TensorRT-LLM documentation
This example also covers the lm_evaluation_harness, MMLU and the human eval accuracy benchmarks, whose details can be found here. The supported lm_eval evaluation tasks are listed here
Model Optimizer supports provide two paths to export the quantized model:
- Unified Hugging Face checkpoints, which can be deployed on TensorRT-LLM (Pytorch and C++ backends), vLLM and SGLang.
- (Legacy) TensorRT-LLM checkpoints, a format that works with TensorRT-LLM C++ backend only.
The unified checkpoint1 format design reflects two key characteristics: 1. The layer structures and tensor names remain aligned with the original Hugging Face checkpoint, and 2. The same checkpoint can be deployed across multiple inference frameworks without modification. A unified checkpoint can be exported using the following commands:
1.Unified checkpoint export currently does not support sparsity. Speculative decoding is only supported in unified checkpoint export. For legacy deployment, exported unified checkpoint then needs a TensorRT-LLM checkpoint converter (e.g., this) to convert and build the TensorRT engine(s) for deployment. Alternatively, call TensorRT-LLM LLM-API to deploy the unified checkpoints e.g., check examples here.
from modelopt.torch.export import export_hf_checkpoint
with torch.inference_mode():
export_hf_checkpoint(
model, # The quantized model.
export_dir, # The directory where the exported files will be stored.
)python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_path <quantized_ckpt_path> --trust_remote_codeHugging Face framework Script
Alternatively, the framework script huggingface_example.sh also supports quantize and export:
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8from tensorrt_llm import LLM
llm_fp8 = LLM(model="<the exported model path>")
print(llm_fp8.generate(["What's the age of the earth? "]))from vllm import LLM
llm_fp8 = LLM(model="<the exported model path>", quantization="modelopt")
print(llm_fp8.generate(["What's the age of the earth? "]))import sglang as sgl
llm_fp8 = sgl.Engine(model_path="<the exported model path>", quantization="modelopt")
print(llm_fp8.generate(["What's the age of the earth? "]))| Model | Quant format | TRT-LLM | vLLM | SGLang |
|---|---|---|---|---|
| LLAMA 3.x | FP8 | ✅ | ✅ | ✅ |
| LLAMA 3.x | FP4 | ✅ | ✅ | ✅ |
| LLAMA 4 | FP8 | ✅ | - | ✅ |
| LLAMA 4 | FP4 | ✅ | - | - |
| DS-R1 | FP8 | ✅ | ✅ | ✅ |
| DS-R1 | FP4 | ✅ | ✅ | ✅ |
| DS-V3 | FP8 | ✅ | ✅ | ✅ |
| DS-V3 | FP4 | ✅ | ✅ | ✅ |
| QWen3 | FP8 | ✅ | ✅ | ✅ |
| QWen3 | FP4 | ✅ | ✅ | - |
| QWen3 MoE | FP8 | ✅ | ✅ | ✅ |
| QWen3 MoE | FP4 | ✅ | - | - |
| QWen2.5 | FP8 | ✅ | ✅ | ✅ |
| QWen2.5 | FP4 | ✅ | ✅ | - |
| QwQ-32B | FP8 | ✅ | ✅ | ✅ |
| QwQ-32B | FP4 | ✅ | ✅ | - |
| Mixtral 8x7B | FP8 | ✅ | ✅ | ✅ |
| Mixtral 8x7B | FP4 | ✅ | - | - |
The user can specify the inference time TP and PP size and the export API will organize the weights to fit the target GPUs.
from modelopt.torch.export import export_tensorrt_llm_checkpoint
with torch.inference_mode():
export_tensorrt_llm_checkpoint(
model, # The quantized model.
decoder_type, # The type of the model, e.g gpt, gptj, or llama.
dtype, # The exported weights data type.
export_dir, # The directory where the exported files will be stored.
inference_tensor_parallel, # The number of GPUs used in the inference time tensor parallel.
inference_pipeline_parallel, # The number of GPUs used in the inference time pipeline parallel.
use_nfs_workspace, # If exporting in a multi-node setup, please specify a shared directory like NFS for cross-node communication.
)After the TensorRT-LLM checkpoint export, you can use the trtllm-build build command to build the engines from the exported checkpoints. Please check the TensorRT-LLM Build API documentation for reference.
- Ready-to-deploy checkpoints [🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection]
- Deployable on TensorRT-LLM, vLLM and SGLang
- More models coming soon!
There are many quantization schemes supported in the example scripts:
-
The FP8 format is available on the Hopper and Ada GPUs with CUDA compute capability greater than or equal to 8.9.
-
The INT8 SmoothQuant, developed by MIT HAN Lab and NVIDIA, is designed to reduce both the GPU memory footprint and inference latency of LLM inference.
-
The INT4 AWQ is an INT4 weight only quantization and calibration method. INT4 AWQ is particularly effective for low batch inference where inference latency is dominated by weight loading time rather than the computation time itself. For low batch inference, INT4 AWQ could give lower latency than FP8/INT8 and lower accuracy degradation than INT8.
-
The W4A8 AWQ is an extension of the INT4 AWQ quantization that it also uses FP8 for activation for more speed up and acceleration.
-
The NVFP4 is one of the new FP4 formats supported by NVIDIA Blackwell GPU and demonstrates good accuracy compared with other 4-bit alternatives. NVFP4 can be applied to both model weights as well as activations, providing the potential for both a significant increase in math throughput and reductions in memory footprint and memory bandwidth usage compared to the FP8 data format on Blackwell.