This folder includes popular 3rd-party LLM benchmarks for LLM accuracy evaluation.
The following instructions show how to evaluate the Model Optimizer quantized LLM with the benchmarks, including the TensorRT-LLM deployment.
LM-Eval-Harness provides a unified framework to test generative language models on a large number of different evaluation tasks.
The supported eval tasks are here.
- For models which fit on a single GPU:
python lm_eval_hf.py --model hf --model_args pretrained=<HF model folder or model card> --tasks <comma separated tasks> --batch_size 4- With model-sharding (for models which require multiple GPUs):
python lm_eval_hf.py --model hf --model_args pretrained=<HF model folder or model card>,parallelize=True --tasks <comma separated tasks> --batch_size 4- For data-parallel evaluation with model-sharding:
With the following command, the model will be sharded across total_num_of_available_gpus/num_copies_of_your_model with a data-parallelism of num_copies_of_your_model
accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> \
lm_eval_hf.py --model hf \
--tasks <comma separated tasks> \
--model_args pretrained=<HF model folder or model card>,parallelize=True \
--batch_size 4- For simulated quantization with any of the default quantization formats:
Multi-GPU evaluation without data-parallelism:
# MODELOPT_QUANT_CFG: Choose from [INT8_SMOOTHQUANT_CFG|FP8_DEFAULT_CFG|NVFP4_DEFAULT_CFG|INT4_AWQ_CFG|W4A8_AWQ_BETA_CFG|MXFP8_DEFAULT_CFG]
python lm_eval_hf.py --model hf \
--tasks <comma separated tasks> \
--model_args pretrained=<HF model folder or model card>,parallelize=True \
--quant_cfg <MODELOPT_QUANT_CFG> \
--batch_size 4NOTE:
MXFP8_DEFAULT_CFGis one the OCP Microscaling Formats (MX Formats) family which defines a set of block-wise dynamic quantization formats. The specifications can be found in the official documentation. Currently we support all MX formats for simulated quantization, includingMXFP8 (E5M2, E4M3), MXFP6 (E3M2, E2M3), MXFP4, MXINT8. However, onlyMXFP8 (E4M3)is in our example configurations, users can create their own configurations for other MX formats by simply modifying thenum_bitsfield in theMXFP8_DEFAULT_CFG.
NOTE: ModelOpt's triton kernels give faster NVFP4 simulated quantization. For details, please see the installation guide.
For data-parallel evaluation, launch with accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> (as shown earlier).
- For simulated optimal per-layer quantization with
auto_quantize:
Multi-GPU evaluation without data-parallelism:
# MODELOPT_QUANT_CFG_TO_SEARCH: Choose the formats to search separated by commas from [W4A8_AWQ_BETA_CFG,FP8_DEFAULT_CFG|NVFP4_DEFAULT_CFG,NONE]
# EFFECTIVE_BITS: Effective bits constraint for auto_quantize
# Examples settings for optimally quantized model with W4A8 & FP8 with effective bits to 4.8:
# MODELOPT_QUANT_CFG_TO_SEARCH=W4A8_AWQ_BETA_CFG,FP8_DEFAULT_CFG|NVFP4_DEFAULT_CFG,NONE
# EFFECTIVE_BITS=4.8
python lm_eval_hf.py --model hf \
--tasks <comma separated tasks> \
--model_args pretrained=<HF model folder or model card>,parallelize=True \
--quant_cfg <AUTOQUANTIZE_SEARCH_FORMATS> \
--auto_quantize_bits <EFFECTIVE_BITS> \
--batch_size 4For data-parallel evaluation, launch with accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> (as shown earlier).
-
If evaluating T5 models:
- use
--model hf-seq2seqinstead.
- use
# MODELOPT_QUANT_CFG: Choose from [INT8_SMOOTHQUANT_CFG|FP8_DEFAULT_CFG|NVFP4_DEFAULT_CFG|INT4_AWQ_CFG|W4A8_AWQ_BETA_CFG|MXFP8_DEFAULT_CFG]
python lm_eval_hf.py --model hf-seq2seq --model_args pretrained=t5-small --quant_cfg=<MODELOPT_QUANT_CFG> --tasks <comma separated tasks> --batch_size 4If trust_remote_code needs to be true, please append the command with the --trust_remote_code flag.
python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,checkpoint_dir=<Quantized checkpoint dir> --tasks <comma separated tasks> --batch_size <max batch size>Massive Multitask Language Understanding. A score (0-1, higher is better) will be printed at the end of the benchmark.
Download data
mkdir -p data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu
cd ..python mmlu.py --model_name causal --model_path <HF model folder or model card># MODELOPT_QUANT_CFG: Choose from [INT8_SMOOTHQUANT_CFG|FP8_DEFAULT_CFG|NVFP4_DEFAULT_CFG|INT4_AWQ_CFG|W4A8_AWQ_BETA_CFG|MXFP8_DEFAULT_CFG]
python mmlu.py --model_name causal --model_path <HF model folder or model card> --quant_cfg MODELOPT_QUANT_CFG# MODELOPT_QUANT_CFG_TO_SEARCH: Choose the formats to search separated by commas from [W4A8_AWQ_BETA_CFG,FP8_DEFAULT_CFG|NVFP4_DEFAULT_CFG,NONE]
# EFFECTIVE_BITS: Effective bits constraint for auto_quantize
# Examples settings for optimally quantized model with W4A8 & FP8 with effective bits to 4.8:
# MODELOPT_QUANT_CFG_TO_SEARCH=W4A8_AWQ_BETA_CFG,FP8_DEFAULT_CFG|NVFP4_DEFAULT_CFG,NONE
# EFFECTIVE_BITS=4.8
python mmlu.py --model_name causal --model_path <HF model folder or model card> --quant_cfg $MODELOPT_QUANT_CFG_TO_SEARCH --auto_quantize_bits $EFFECTIVE_BITS --batch_size 4python mmlu.py --model_name causal --model_path <HF model folder or model card> --checkpoint_dir <Quantized checkpoint dir>MT-Bench. These responses are generated using FastChat.
bash run_fastchat.sh -h <HF model folder or model card># MODELOPT_QUANT_CFG: Choose from [INT8_SMOOTHQUANT_CFG|FP8_DEFAULT_CFG|INT4_AWQ_CFG|W4A8_AWQ_BETA_CFG]
bash run_fastchat.sh -h <HF model folder or model card> --quant_cfg MODELOPT_QUANT_CFGbash run_fastchat.sh -h <HF model folder or model card> <Quantized checkpoint dir>The responses to questions from MT Bench will be stored under data/mt_bench/model_answer.
The quality of the responses can be judged using llm_judge from the FastChat repository. Please refer to the llm_judge to compute the final MT-Bench score.
LiveCodeBench is a holistic and contamination-free evaluation benchmark of LLMs for code that continuously collects new problems over time.
We support running LiveCodeBench against a local running OpenAI API compatible server. For example, quantized TensorRT-LLM checkpoint or engine can be loaded using trtllm-serve command. Once the local server is up, the following command can be used to run the LiveCodeBench:
bash run_livecodebench.sh <custom defined model name> <prompt batch size in parallel> <max output tokens> <local model server port>Simple Evals is a lightweight library for evaluating language models published from OpenAI. This eval includes "simpleqa", "mmlu", "math", "gpqa", "mgsm", "drop" and "humaneval" benchmarks.
Similarly, we support running simple evals against a local running OpenAI API compatible server. Once the local server is up, the following command can be used to run the Simple Evals:
bash run_simple_eval.sh <custom defined model name> <comma separated eval names> <max output tokens> <local model server port>An example of customized quantization config is shown in quantization_utils.py. It allows users to test accuracy of a custom method without the need of modifying the whole deployment framework, e.g., TensorRT-LLM, vLLM, SGLang, etc. Users can disable quantization of specific layers to debug the cause of accuracy drop, or explore a promising new quantization method.
python lm_eval_hf.py --model hf \
--tasks <comma separated tasks> \
--model_args pretrained=<HF model folder or model card>,parallelize=True \
--quant_cfg MY_QUANT_CONFIG \
--batch_size 4The run_lm_eval_vllm.sh script provides a convenient way to run evaluations using the lm-evaluation-harness library against a model served with vLLM's OpenAI-compatible API endpoint.
This is useful for evaluating quantized models deployed with vLLM or any model served via its OpenAI API interface. More importantly, for new models that are neither supported natively by vLLM or Transformers, but Transformers compatible, they can still be evaluated with vLLM's endpoint! By Transformers compatible, it needs to satisfy the following:
- The model directory must have the correct structure (e.g.
config.jsonis present) config.jsonmust containauto_map.AutoModel.- Customization should be done in the base model (e.g. in
MyModel, notMyModelForCausalLM).
- Install vLLM: Follow the installation instructions at https://docs.vllm.ai/en/latest/getting_started/installation.html.
-
Start the vLLM OpenAI-compatible Server: In a separate terminal, launch the vLLM server with your desired model. For example:
# Example using vLLM's built-in server vllm serve <your_model_name_or_path> \ --port 8000 \ --tensor-parallel-size <tp_size> # Adjust as needed
Replace
<your_model_name_or_path>with the actual model identifier (e.g.,Qwen/Qwen3-30B-A3B) and adjust the--portand--tensor-parallel-sizeif necessary. You may also need to disable vllm v1 byexport VLLM_USE_V1=0if you encounter issues.To serve a modelopt quantized model, add
--quantization modelopt, for example:# Example using vLLM's built-in server vllm serve nvidia/Llama-3.1-8B-Instruct-FP8 \ --quantization modelopt --port 8000 \ --tensor-parallel-size <tp_size> # Adjust as needed
To generate the quantized model such as
nvidia/Llama-3.1-8B-Instruct-FP8, please refer to instructions here. Note currently modelopt quantized model support in vLLM is limited, we are working on expanding the model and quant formats support. -
Make the script executable (if not already):
chmod +x run_lm_eval_vllm.sh
-
Run the evaluation script from the
examples/llm_evaldirectory:./run_lm_eval_vllm.sh <model_name> [port] [task]
<model_name>: The name of the model being served (this is passed tolm_eval, e.g.,Qwen/Qwen3-30B-A3B).[port]: (Optional) The port the vLLM server is listening on. Defaults to8000. Note, it must match the number when launch the server.[task]: (Optional) Thelm_evaltask(s) to run. Defaults tommlu. Can be a single task or a comma-separated list (e.g.,"mmlu,hellaswag").
-
Evaluate Qwen3-30B-A3B on MMLU (default task and port):
# Assumes vLLM server running with Qwen/Qwen3-30B-A3B on port 8000 ./run_lm_eval_vllm.sh Qwen/Qwen3-30B-A3B -
Evaluate a model on Hellaswag using port 8001:
# Assumes vLLM server running with <model_name> on port 8001 ./run_lm_eval_vllm.sh <model_name> 8001 hellaswag
-
Evaluate on multiple tasks:
# Assumes vLLM server running with <model_name> on port 8000 ./run_lm_eval_vllm.sh <model_name> 8000 "arc_easy,winogrande"