Skip to content

Latest commit

 

History

History
executable file
·
428 lines (294 loc) · 25.3 KB

File metadata and controls

executable file
·
428 lines (294 loc) · 25.3 KB

Post-training quantization (PTQ)

Quantization is an effective model optimization technique that compresses your models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality.

Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4 and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and Double Quantization with easy-to-use Python APIs.

This section focuses on Post-training quantization, a technique that reduces model precision after training to improve inference efficiency without requiring retraining.

Section Description Link Docs
Pre-Requisites Required & optional packages to use this technique [Link]
Getting Started Learn how to optimize your models using PTQ to reduce precision and improve inference efficiency [Link] [docs]
Support Matrix View the support matrix to see quantization compatibility and feature availability across different models [Link]
AutoQuantize Automatically chooses layers/precisions for mixed precision quantization to enhanced inference performance and accuracy tradeoffs [Link] [docs]
Real Quant Real Quant compresses model weights in a low-precision format to reduce memory requirements of quantization. [Link]
Framework Scripts Example scripts demonstrating quantization techniques for optimizing Hugging Face / NeMo / Megatron-LM models [Link]
Evaluate Accuracy Evaluate your model's accuracy! [Link]
Exporting Checkpoints Export to Hugging Face Unified Checkpoint and deploy on TRT-LLM/vLLM/SGLang [Link] [docs]
Pre-Quantized Checkpoints Ready to deploy Hugging Face pre-quantized checkpoints [Link]
Resources Extra links to relevant resources [Link]

Pre-Requisites

Docker

For Hugging Face models, please use the TensorRT-LLM docker image (e.g., nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2). For NeMo models, use the NeMo container (e.g., nvcr.io/nvidia/nemo:25.07). Visit our installation docs for more information.

Also follow the installation steps below to upgrade to the latest version of Model Optimizer and install example-specific dependencies.

Local Installation

For Hugging Face models, install Model Optimizer with hf dependencies using pip from PyPI and install the requirements for the example:

pip install -U nvidia-modelopt[hf]
pip install -r requirements.txt

For TensorRT-LLM deployment, please use the TensorRT-LLM docker image or follow their installation docs. Similarly, for vLLM or SGLang deployment, please use their installation docs.

Getting Started

1. Quantize (Post Training Quantization)

With the simple API below, you can very easily use Model Optimizer to quantize your model. Model Optimizer achieves this by converting the precision of your model to the desired precision, and then using a small dataset (typically 128-512 samples) to calibrate the quantization scaling factors. The accuracy of PTQ is typically robust across different choices of calibration data, by default Model Optimizer uses cnn_dailymail. Users can try other datasets by easily modifying the calib_set.

import modelopt.torch.quantization as mtq

# Setup the model
model = AutoModelForCausalLM.from_pretrained("...")

# Simplified example set up a calibration data loader with the desired calib_size
calib_set = get_dataloader(num_samples=calib_size)

# Prepare the calibration set and define a forward loop
def forward_loop(model):
    for batch in calib_set:
        model(batch)

# PTQ with in-place replacement to quantized modules
model = mtq.quantize(model, mtq.INT8_SMOOTHQUANT_CFG, forward_loop)

2. Export Quantized Model

Once your model is quantized, you can now export that model to a checkpoint for easy deployment.
We provide two APIs to export the quantized model:

  • Unified Hugging Face checkpoints, which can be deployed on TensorRT-LLM (Pytorch and C++ backends), vLLM and SGLang.
  • (Legacy) TensorRT-LLM checkpoints, a format that works with TensorRT-LLM C++ backend only.

Unified Hugging Face Checkpoints

from modelopt.torch.export import export_hf_checkpoint

with torch.inference_mode():
    export_hf_checkpoint(
        model,  # The quantized model.
        export_dir,  # The directory where the exported files will be stored.
    )

(Legacy) TensorRT-LLM Checkpoints

The user can specify the inference time TP and PP size and the export API will organize the weights to fit the target GPUs.

from modelopt.torch.export import export_tensorrt_llm_checkpoint

with torch.inference_mode():
    export_tensorrt_llm_checkpoint(
        model,  # The quantized model.
        decoder_type,  # The type of the model, e.g gpt, gptj, or llama.
        dtype,  # The exported weights data type.
        export_dir,  # The directory where the exported files will be stored.
        inference_tensor_parallel,  # The number of GPUs used in the inference time tensor parallel.
        inference_pipeline_parallel,  # The number of GPUs used in the inference time pipeline parallel.
        use_nfs_workspace,  # If exporting in a multi-node setup, please specify a shared directory like NFS for cross-node communication.
    )

After the TensorRT-LLM checkpoint export, you can use the trtllm-build build command to build the engines from the exported checkpoints. Please check the TensorRT-LLM Build API documentation for reference.

Please reference our framework scripts and our docs for more details.

Support Matrix

Hugging Face Supported Models

Model fp8 int8_sq int4_awq w4a8_awq1 nvfp45
LLAMA 3.x 3
LLAMA 4 6
Mixtral 2
Phi-3,4 3 -
Phi-3.5 MOE -
Llama-Nemotron Super
Llama-Nemotron Ultra
Gemma 3 2 - - -
QWen 2, 2.5 4
QWen3 MOE 6 - - -
QwQ - - -
T5 -
Whisper -

This is a subset of the models supported. For the full list please check the TensorRT-LLM support matrix

1.The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.
2.For some models, there is only support for exporting quantized checkpoints.
3.W4A8_AWQ is only available on some models but not all
4.For some models, KV cache quantization may result in a higher accuracy penalty.
5.A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later
6.Some models currently support export to HF format only.

The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying hf_ptq.py and disabling the KV cache quantization or using the QAT instead.

You can also create your own custom config using this guide.

NeMo Supported Models

Please refer to the NeMo 2.0 PTQ documentation for supported models.

AutoQuantize

AutoQuantize (mtq.auto_quantize) is a PTQ algorithm which quantizes a model by searching for the best quantization format per-layer while meeting performance constraints specified by the user. AutoQuantize streamlines the trade-off of model accuracy and performance.

Currently AutoQuantize supports only auto_quantize_bits as the performance constraint (for both weight-only quantization and weight & activation quantization). auto_quantize_bits constraint specifies the effective number of bits for the quantized model.

You may specify an auto_quantize_bits constraint such as 4.8 for mixed precision quantization using NVFP4_DEFAULT_CFG & FP8_DEFAULT_CFG. AutoQuantize will automatically quantize highly sensitive layers in FP8_DEFAULT_CFG while keeping less sensitive layers in NVFP4_DEFAULT_CFG (and even skip quantization for any extremely sensitive layers) so that the the final mixed precision quantized model has an effective quantized bits of 4.8. This model would give a better accuracy than the model quantized with vanilla NVFP4_DEFAULT_CFG configuration since the more aggressive NVFP4_DEFAULT_CFG quantization was not applied for the highly sensitive layers.

Here is an example usage for AutoQuantize algorithm (Please see auto_quantize API for more details):

    import modelopt.torch.quantization as mtq

    # Define the model & calibration dataloader
    model = ...
    calib_dataloader = ...

    # Define forward_step function.
    # forward_step should take the model and data as input and return the output
    def forward_step(model, data):
        output =  model(data)
        return output

    # Define loss function which takes the model output and data as input and returns the loss
    def loss_func(output, data):
        loss = ...
        return loss


    # Perform AutoQuantize
    model, search_state_dict = mtq.auto_quantize(
        model,
        constraints = {"auto_quantize_bits": 4.8},
        # supported quantization formats are listed in `modelopt.torch.quantization.config.choices`
        quantization_formats = ["NVFP4_DEFAULT_CFG", "FP8_DEFAULT_CFG"]
        data_loader = calib_dataloader,
        forward_step=forward_step,
        loss_func=loss_func,
        ...
        )

AutoQuantize for Hugging Face models

AutoQuantize can be performed for Huggingface LLM models like Llama-3 as shown below:

Script

export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or simply the model card>
# --auto_quantize_bits specifies the constraint for `AutoQuantize`
# --quant specifies the formats to be searched for `AutoQuantize`
# NOTE: auto_quantize_bits cannot be lower than the number of bits for the smallest quantization format in --quant
scripts/huggingface_example.sh --type llama --model $HF_PATH --quant w4a8_awq,fp8 --auto_quantize_bits 4.8 --tp [1|2|4|8]  --calib_batch_size 4

The above example perform AutoQuantize where the less quantization accuracy sensitive layers are quantized with w4a8_awq (specified by --quant w4a8_awq) and the more sensitive layers are kept un-quantized such that the effective bits is 4.8 (specified by --auto_quantize_bits 4.8).

The example scripts above also have an additional flag --tasks, where the actual tasks run in the script can be customized. The allowed tasks are quant,mmlu,lm_eval,livecodebench specified in the script parser. The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the --lm_eval_tasks flag with comma separated lm_eval tasks here.

If GPU out-of-memory error is reported running the scripts, please try editing the scripts and reducing the max batch size to save GPU memory.

NOTE: AutoQuantize requires backpropagation of the model. Models without backpropagation support (e.g., Llama-4) will not work with AutoQuantize.

Real Quant

When working with large language models, memory constraints can be a significant challenge. ModelOpt provides a workflow for initializing HF models with compressed weights across multiple GPUs to dramatically reduce memory usage. Check --low_memory_mode option in hf_ptq.py for more details.

import modelopt.torch.quantization as mtq
from modelopt.torch.quantization.plugins import init_quantized_weights
from transformers import AutoModelForCausalLM, AutoConfig

# Step 1: Initialize the model with compressed weights
with init_quantized_weights(mtq.NVFP4_DEFAULT_CFG):
    model = AutoModelForCausalLM.from_pretrained(ckpt_path)

# Step 2: Calibrate the model
mtq.calibrate(model, algorithm="max", forward_loop=calibrate_loop)

Framework Scripts

Hugging Face Example Script

For LLM models like Llama-3:

# Install model specific pip dependencies if needed

export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or simply the model card>
scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]

By default trust_remote_code is set to false. Please turn it on if model calibration and eval requires it using --trust_remote_code.

If the Huggingface model calibration fails on a multi-GPU system due to mismatched tensor placement, please try setting CUDA_VISIBLE_DEVICES to a smaller number.

FP8 calibration over a large model with limited GPU memory is not recommended but possible with the accelerate package. Please tune the device_map setting in example_utils.py if needed for model loading and the calibration process can be slow.

Huggingface models trained with modelopt.torch.speculative can be used as regular Huggingface models in PTQ. Note: there is a known issue with Huggingface models loaded across multiple GPUs for inference (i.e., "Expected all tensors to be on the same device, but found at least two devices..."). When encountered this error in PTQ of speculative decoding models, try reducing the number of GPUs used.

Calibration by default uses left padding_side for the Huggingface tokenizer as it usually leads to lower accuracy loss. The exported tokenizer files restores the default padding_side.

If a GPU OOM error occurs during model quantization despite sufficient memory, setting the --use_seq_device_map flag can help. This enforces sequential device mapping, distributing the model across GPUs and utilizing up to 80% of each GPU's memory.

You can add --low_memory_mode to the command to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.

Deepseek R1

PTQ for DeepSeek shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.

NeMo Example Script

NeMo 2.0 framework PTQ and TensorRT-LLM deployment examples are maintained in the NeMo GitHub repo. Please refer to the NeMo PTQ documentation for more details.

Megatron-LM Example Script

Megatron-LM framework PTQ and TensorRT-LLM deployment examples are maintained in the Megatron-LM GitHub repo. Please refer to the examples here.

Evaluate Accuracy

TensorRT-LLM Validation

A list of accuracy validation benchmarks are provided in the llm_eval directory. Right now MMLU, and MTbench are supported in this example by specifying the --tasks flag running the scripts mentioned above. For MTBench, the task only runs the answer generation stage. Please follow fastchat to get the evaluation judge score.

The benchmark_suite.py script is used as a fast performance benchmark. For details, please refer to the TensorRT-LLM documentation

This example also covers the lm_evaluation_harness, MMLU and the human eval accuracy benchmarks, whose details can be found here. The supported lm_eval evaluation tasks are listed here

Exporting Checkpoints

Model Optimizer supports provide two paths to export the quantized model:

  • Unified Hugging Face checkpoints, which can be deployed on TensorRT-LLM (Pytorch and C++ backends), vLLM and SGLang.
  • (Legacy) TensorRT-LLM checkpoints, a format that works with TensorRT-LLM C++ backend only.

The unified checkpoint1 format design reflects two key characteristics: 1. The layer structures and tensor names remain aligned with the original Hugging Face checkpoint, and 2. The same checkpoint can be deployed across multiple inference frameworks without modification. A unified checkpoint can be exported using the following commands:

1.Unified checkpoint export currently does not support sparsity. Speculative decoding is only supported in unified checkpoint export. For legacy deployment, exported unified checkpoint then needs a TensorRT-LLM checkpoint converter (e.g., this) to convert and build the TensorRT engine(s) for deployment. Alternatively, call TensorRT-LLM LLM-API to deploy the unified checkpoints e.g., check examples here.

API

from modelopt.torch.export import export_hf_checkpoint

with torch.inference_mode():
    export_hf_checkpoint(
        model,  # The quantized model.
        export_dir,  # The directory where the exported files will be stored.
    )

Quantize and Export

python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_path <quantized_ckpt_path> --trust_remote_code

Hugging Face framework Script

Alternatively, the framework script huggingface_example.sh also supports quantize and export:

scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8

Deployment


TRT-LLM

from tensorrt_llm import LLM

llm_fp8 = LLM(model="<the exported model path>")
print(llm_fp8.generate(["What's the age of the earth? "]))

vLLM

from vllm import LLM

llm_fp8 = LLM(model="<the exported model path>", quantization="modelopt")
print(llm_fp8.generate(["What's the age of the earth? "]))

SGLang

import sglang as sgl

llm_fp8 = sgl.Engine(model_path="<the exported model path>", quantization="modelopt")
print(llm_fp8.generate(["What's the age of the earth? "]))

Unified HF Checkpoint Deployment Model Support Matrix

Model Quant format TRT-LLM vLLM SGLang
LLAMA 3.x FP8
LLAMA 3.x FP4
LLAMA 4 FP8 -
LLAMA 4 FP4 - -
DS-R1 FP8
DS-R1 FP4
DS-V3 FP8
DS-V3 FP4
QWen3 FP8
QWen3 FP4 -
QWen3 MoE FP8
QWen3 MoE FP4 - -
QWen2.5 FP8
QWen2.5 FP4 -
QwQ-32B FP8
QwQ-32B FP4 -
Mixtral 8x7B FP8
Mixtral 8x7B FP4 - -

(Legacy) TensorRT-LLM Checkpoints

The user can specify the inference time TP and PP size and the export API will organize the weights to fit the target GPUs.

from modelopt.torch.export import export_tensorrt_llm_checkpoint

with torch.inference_mode():
    export_tensorrt_llm_checkpoint(
        model,  # The quantized model.
        decoder_type,  # The type of the model, e.g gpt, gptj, or llama.
        dtype,  # The exported weights data type.
        export_dir,  # The directory where the exported files will be stored.
        inference_tensor_parallel,  # The number of GPUs used in the inference time tensor parallel.
        inference_pipeline_parallel,  # The number of GPUs used in the inference time pipeline parallel.
        use_nfs_workspace,  # If exporting in a multi-node setup, please specify a shared directory like NFS for cross-node communication.
    )

Build the TensorRT-LLM engines

After the TensorRT-LLM checkpoint export, you can use the trtllm-build build command to build the engines from the exported checkpoints. Please check the TensorRT-LLM Build API documentation for reference.

Pre-Quantized Checkpoints

Resources

Technical Resources

There are many quantization schemes supported in the example scripts:

  1. The FP8 format is available on the Hopper and Ada GPUs with CUDA compute capability greater than or equal to 8.9.

  2. The INT8 SmoothQuant, developed by MIT HAN Lab and NVIDIA, is designed to reduce both the GPU memory footprint and inference latency of LLM inference.

  3. The INT4 AWQ is an INT4 weight only quantization and calibration method. INT4 AWQ is particularly effective for low batch inference where inference latency is dominated by weight loading time rather than the computation time itself. For low batch inference, INT4 AWQ could give lower latency than FP8/INT8 and lower accuracy degradation than INT8.

  4. The W4A8 AWQ is an extension of the INT4 AWQ quantization that it also uses FP8 for activation for more speed up and acceleration.

  5. The NVFP4 is one of the new FP4 formats supported by NVIDIA Blackwell GPU and demonstrates good accuracy compared with other 4-bit alternatives. NVFP4 can be applied to both model weights as well as activations, providing the potential for both a significant increase in math throughput and reductions in memory footprint and memory bandwidth usage compared to the FP8 data format on Blackwell.