Skip to content

A Python library for guardrail models evaluation.

License

Notifications You must be signed in to change notification settings

eldarkurtic/GuardBench

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Eldar README

This benchmark is based on prompting models to look at some text and produce either "safe" or "unsafe" keywords to classify the given input (optionally with generating the list of violated policies). We would like to benchmark models with the vLLM inference engine instead of the default transformers backend. We have two options to use vLLM: (1) via vllm server, or (2) via python interface (LLM class).

(1) vLLM server

Since the benchmark consists of extracting logits for safe and unsafe keywords from the first generated token by an LLM, this requires transferring logits for the full vocab size over HTTP from vllm server to our client. After running this version with the full vocabulary size (no concurrent requests, just batch-size=1) we notice that the full 40 datasets eval takes 12 hours compared to 2 hours via transformers backend. The slowness comes from transferring all vocab size logits over HTTP. We validate this via benchmarking. Serve the model with vllm serve meta-llama/LlamaGuard-7b -tp 1 --api-key EMPTY --logprobs-mode raw_logits --max-logprobs 32000 and measure time to transfer full vocab python vllm_benchmarking_toplogprobs.py --microbench --model meta-llama/LlamaGuard-7b --datasets all --batch_size 1 --output_dir test versus only top 2 logits (for safe and unsafe keywords) python vllm_eval.py --microbench --model meta-llama/LlamaGuard-7b --datasets all --batch_size 1 --output_dir test --top_logprobs 2 --logit_bias_strength 100 (we add logit bias to safe and unsafe keywords and request only top 2 logits; here we assume that our bias is large enough so that safe/unsafe will always be in the top 2; this doesn't change their relative probabilities because we add the same bias to both). The results are as follows:

microbench_moderate results
  iters=20 warmup=5
  top_logprobs=32000 logit_bias_strength=0.0
  total_s: mean=0.4481 p50=0.4449 p90=0.4647
  rpc_s:   mean=0.4414 p50=0.4380 p90=0.4580
  parse_s: mean=0.0000 p50=0.0000 p90=0.0000

microbench_moderate results
  iters=20 warmup=5
  top_logprobs=2 logit_bias_strength=100.0
  total_s: mean=0.0183 p50=0.0184 p90=0.0186
  rpc_s:   mean=0.0180 p50=0.0181 p90=0.0183
  parse_s: mean=0.0000 p50=0.0000 p90=0.0000

So, p50(top_logprobs=32000) / p50(top_logprobs=2) = 25 times slower!

TLDR: With vLLM server approach, we should run evaluations by adding logit-bias to safe/unsafe keywords and requesting only top 2 logits over HTTP. The most important note here is that the model should be served with vllm serve meta-llama/LlamaGuard-7b -tp 1 --api-key EMPTY --logprobs-mode processed_logits --max-logprobs 32000 (processed_logits returns logits after the bias is applied). And we run the evaluation script as: python vllm_server_eval.py --model meta-llama/LlamaGuard-7b --datasets all --output_dir top2_logs --top_logprobs 2 --logit_bias_strength 100.

The results (F1, Recall) are fully matching results with transformers backend (baseline):

Dataset Transformers vLLM (Full Vocab) vLLM (Top 2 with Bias)
AART (0.904, 0.824) (0.903, 0.824) (0.903, 0.824)
AdvBench Behaviors (0.911, 0.837) (0.911, 0.837) (0.911, 0.837)
AdvBench Strings (0.894, 0.808) (0.892, 0.805) (0.892, 0.805)
BeaverTails 330k (0.685, 0.545) (0.687, 0.546) (0.687, 0.546)
Bot-Adversarial Dialogue (0.634, 0.729) (0.634, 0.730) (0.634, 0.730)
CatQA (0.890, 0.802) (0.888, 0.798) (0.888, 0.798)
ConvAbuse (0.437, 0.312) (0.437, 0.312) (0.437, 0.312)
DecodingTrust Stereotypes (0.932, 0.873) (0.934, 0.877) (0.934, 0.877)
DICES 350 (0.272, 0.215) (0.270, 0.215) (0.270, 0.215)
DICES 990 (0.411, 0.348) (0.415, 0.355) (0.415, 0.355)
Do Anything Now Questions (0.660, 0.492) (0.662, 0.495) (0.662, 0.495)
DoNotAnswer (0.487, 0.322) (0.485, 0.321) (0.485, 0.321)
DynaHate (0.804, 0.842) (0.804, 0.842) (0.804, 0.842)
HarmEval (0.516, 0.347) (0.520, 0.351) (0.520, 0.351)
HarmBench Behaviors (0.650, 0.481) (0.650, 0.481) (0.650, 0.481)
HarmfulQ (0.947, 0.900) (0.945, 0.895) (0.945, 0.895)
HarmfulQA Questions (0.579, 0.408) (0.577, 0.405) (0.577, 0.405)
HarmfulQA (0.171, 0.094) (0.174, 0.095) (0.174, 0.095)
HateCheck (0.943, 0.979) (0.943, 0.979) (0.943, 0.979)
Hatemoji Check (0.862, 0.808) (0.863, 0.810) (0.863, 0.810)
HEx-PHI (0.830, 0.710) (0.830, 0.710) (0.830, 0.710)
I-CoNa (0.956, 0.916) (0.956, 0.916) (0.956, 0.916)
I-Controversial (0.947, 0.900) (0.947, 0.900) (0.947, 0.900)
I-MaliciousInstructions (0.883, 0.790) (0.883, 0.790) (0.883, 0.790)
I-Physical-Safety (0.147, 0.080) (0.147, 0.080) (0.147, 0.080)
JBB Behaviors (0.789, 0.730) (0.789, 0.730) (0.789, 0.730)
MaliciousInstruct (0.901, 0.820) (0.901, 0.820) (0.901, 0.820)
MITRE (0.296, 0.174) (0.296, 0.174) (0.296, 0.174)
NicheHazardQA (0.511, 0.343) (0.508, 0.340) (0.508, 0.340)
OpenAI Moderation Dataset (0.745, 0.686) (0.747, 0.690) (0.747, 0.690)
ProsocialDialog (0.518, 0.360) (0.519, 0.360) (0.519, 0.360)
SafeText (0.134, 0.073) (0.139, 0.076) (0.139, 0.076)
SimpleSafetyTests (0.925, 0.860) (0.925, 0.860) (0.925, 0.860)
StrongREJECT Instructions (0.908, 0.831) (0.908, 0.831) (0.908, 0.831)
TDCRedTeaming (0.889, 0.800) (0.889, 0.800) (0.889, 0.800)
TechHazardQA (0.777, 0.635) (0.777, 0.636) (0.777, 0.636)
Toxic Chat (0.558, 0.434) (0.565, 0.439) (0.565, 0.439)
ToxiGen (0.783, 0.761) (0.782, 0.764) (0.782, 0.764)
XSTest (0.817, 0.825) (0.814, 0.820) (0.814, 0.820)

The expected runtime on a single H100 GPU is: Transformers backend (2h), vLLM server w/ full vocab (12h), vLLM server with top 2 logits and bias (30mins).

(2) Python interface (LLM class)

We also implement a version with vLLM's Python interface via LLM class. This one has a very similar structure as the original transformers-based implementation. Run with: python vllm_pyclass_eval.py --model meta-llama/LlamaGuard-7b --datasets all --output_dir my_logs --batch_size -1. To run all evals, it takes 4h on a single H100 GPU.

Dataset Transformers vLLM server (Full Vocab) vLLM (LLM class)
AART (0.904, 0.824) (0.903, 0.824) (0.903, 0.824)
AdvBench Behaviors (0.911, 0.837) (0.911, 0.837) (0.912, 0.838)
AdvBench Strings (0.894, 0.808) (0.892, 0.805) (0.893, 0.807)
BeaverTails 330k (0.685, 0.545) (0.687, 0.546) (0.687, 0.546)
Bot-Adversarial Dialogue (0.634, 0.729) (0.634, 0.730) (0.636, 0.731)
CatQA (0.890, 0.802) (0.888, 0.798) (0.886, 0.795)
ConvAbuse (0.437, 0.312) (0.437, 0.312) (0.446, 0.320)
DecodingTrust Stereotypes (0.932, 0.873) (0.934, 0.877) (0.934, 0.876)
DICES 350 (0.272, 0.215) (0.270, 0.215) (0.283, 0.228)
DICES 990 (0.411, 0.348) (0.415, 0.355) (0.411, 0.348)
Do Anything Now Questions (0.660, 0.492) (0.662, 0.495) (0.662, 0.495)
DoNotAnswer (0.487, 0.322) (0.485, 0.321) (0.487, 0.322)
DynaHate (0.804, 0.842) (0.804, 0.842) (0.804, 0.843)
HarmEval (0.516, 0.347) (0.520, 0.351) (0.518, 0.349)
HarmBench Behaviors (0.650, 0.481) (0.650, 0.481) (0.650, 0.481)
HarmfulQ (0.947, 0.900) (0.945, 0.895) (0.942, 0.890)
HarmfulQA Questions (0.579, 0.408) (0.577, 0.405) (0.580, 0.408)
HarmfulQA (0.171, 0.094) (0.174, 0.095) (0.175, 0.096)
HateCheck (0.943, 0.979) (0.943, 0.979) (0.943, 0.979)
Hatemoji Check (0.862, 0.808) (0.863, 0.810) (0.863, 0.810)
HEx-PHI (0.830, 0.710) (0.830, 0.710) (0.830, 0.710)
I-CoNa (0.956, 0.916) (0.956, 0.916) (0.956, 0.916)
I-Controversial (0.947, 0.900) (0.947, 0.900) (0.947, 0.900)
I-MaliciousInstructions (0.883, 0.790) (0.883, 0.790) (0.883, 0.790)
I-Physical-Safety (0.147, 0.080) (0.147, 0.080) (0.147, 0.080)
JBB Behaviors (0.789, 0.730) (0.789, 0.730) (0.785, 0.730)
MaliciousInstruct (0.901, 0.820) (0.901, 0.820) (0.901, 0.820)
MITRE (0.296, 0.174) (0.296, 0.174) (0.296, 0.174)
NicheHazardQA (0.511, 0.343) (0.508, 0.340) (0.511, 0.343)
OpenAI Moderation Dataset (0.745, 0.686) (0.747, 0.690) (0.748, 0.690)
ProsocialDialog (0.518, 0.360) (0.519, 0.360) (0.519, 0.361)
SafeText (0.134, 0.073) (0.139, 0.076) (0.134, 0.073)
SimpleSafetyTests (0.925, 0.860) (0.925, 0.860) (0.925, 0.860)
StrongREJECT Instructions (0.908, 0.831) (0.908, 0.831) (0.908, 0.831)
TDCRedTeaming (0.889, 0.800) (0.889, 0.800) (0.889, 0.800)
TechHazardQA (0.777, 0.635) (0.777, 0.636) (0.778, 0.636)
Toxic Chat (0.558, 0.434) (0.565, 0.439) (0.566, 0.439)
ToxiGen (0.783, 0.761) (0.782, 0.764) (0.783, 0.764)
XSTest (0.817, 0.825) (0.814, 0.820) (0.814, 0.820)

Llama-Guard-4-12B

Now we report (F1, Recall) scores for the new model meta-llama/Llama-Guard-4-12B with both, transformers and vllm backend. Transformers results are obtained by running python transformers_mm_eval.py and vllm results by serving the model with vllm serve meta-llama/Llama-Guard-4-12B -tp 1 --api-key EMPTY --logprobs-mode processed_logits --max-logprobs 10 --max-model-len 131072 and python vllm_server_mm_eval.py --model meta-llama/Llama-Guard-4-12B --datasets all --output_dir output_dir/Llama-Guard-4-12B --top_logprobs 10 --logit_bias_strength 0.0. In the vllm_server_mm_eval.py we can't use the logit-bias trick as we did in the section above because Llama-Guard-4-12B tends not to produce safe/unsafe as the very first token but rather some formatting tokens like \n\n. Because of this we generate at most 5 tokens and iterate through them until we find the first instance of safe/unsafe, which we then use to calculate probs for F1/Recall.

Dataset Transformers vLLM server
AART (0.874, 0.776) (0.874, 0.776)
AdvBench Behaviors (0.964, 0.931) (0.964, 0.931)
AdvBench Strings (0.830, 0.709) (0.830, 0.709)
BeaverTails 330k (0.732, 0.591) (0.732, 0.591)
Bot-Adversarial Dialogue (0.513, 0.377) (0.513, 0.376)
CatQA (0.932, 0.873) (0.932, 0.873)
ConvAbuse (0.237, 0.148) (0.241, 0.148)
DecodingTrust Stereotypes (0.589, 0.418) (0.591, 0.419)
DICES 350 (0.118, 0.063) (0.118, 0.063)
DICES 990 (0.219, 0.135) (0.219, 0.135)
Do Anything Now Questions (0.746, 0.595) (0.746, 0.595)
DoNotAnswer (0.549, 0.378) (0.546, 0.376)
DynaHate (0.604, 0.481) (0.603, 0.481)
HarmEval (0.560, 0.389) (0.560, 0.389)
HarmBench Behaviors (0.961, 0.925) (0.959, 0.922)
HarmfulQ (0.860, 0.755) (0.860, 0.755)
HarmfulQA Questions (0.588, 0.416) (0.588, 0.416)
HarmfulQA (0.375, 0.231) (0.374, 0.231)
HateCheck (0.782, 0.667) (0.782, 0.667)
Hatemoji Check (0.626, 0.475) (0.625, 0.474)
HEx-PHI (0.964, 0.930) (0.966, 0.933)
I-CoNa (0.833, 0.713) (0.837, 0.719)
I-Controversial (0.596, 0.425) (0.596, 0.425)
I-MaliciousInstructions (0.824, 0.700) (0.824, 0.700)
I-Physical-Safety (0.493, 0.340) (0.493, 0.340)
JBB Behaviors (0.866, 0.870) (0.860, 0.860)
MaliciousInstruct (0.953, 0.910) (0.953, 0.910)
MITRE (0.664, 0.497) (0.663, 0.495)
NicheHazardQA (0.460, 0.299) (0.460, 0.299)
OpenAI Moderation Dataset (0.740, 0.789) (0.739, 0.787)
ProsocialDialog (0.426, 0.275) (0.427, 0.276)
SafeText (0.375, 0.257) (0.372, 0.254)
SimpleSafetyTests (0.985, 0.970) (0.985, 0.970)
StrongREJECT Instructions (0.910, 0.836) (0.910, 0.836)
TDCRedTeaming (0.947, 0.900) (0.947, 0.900)
TechHazardQA (0.759, 0.611) (0.758, 0.610)
Toxic Chat (0.433, 0.519) (0.433, 0.519)
ToxiGen (0.459, 0.315) (0.460, 0.315)
XSTest (0.834, 0.780) (0.834, 0.780)

PyPI version Documentation Status License: EUPL-1.2

GuardBench

πŸ”₯ News

  • [October 9, 2025] GuardBench now supports four additional datasets: JBB Behaviors, NicheHazardQA, HarmEval, and TechHazardQA. Also, it now allows for choosing the metrics to show at the end of the evaluation. Supported metrics are: precision (Precision), recall (Recall), f1 (F1), mcc (Matthews Correlation Coefficient), auprc (AUPRC), sensitivity (Sensitivity), specificity (Specificity), g_mean (G-Mean), fpr (False Positive Rate), fnr (False Negative Rate).

⚑️ Introduction

GuardBench is a Python library for the evaluation of guardrail models, i.e., LLMs fine-tuned to detect unsafe content in human-AI interactions. GuardBench provides a common interface to 40 evaluation datasets, which are downloaded and converted into a standardized format for improved usability. It also allows to quickly compare results and export LaTeX tables for scientific publications. GuardBench's benchmarking pipeline can also be leveraged on custom datasets.

GuardBench was featured in EMNLP 2024. The related paper is available here.

GuardBench has a public leaderboard available on HuggingFace.

You can find the list of supported datasets here. A few of them requires authorization. Please, read this.

If you use GuardBench to evaluate guardrail models for your scientific publications, please consider citing our work.

✨ Features

πŸ”Œ Requirements

python>=3.10

πŸ’Ύ Installation

pip install guardbench

πŸ’‘ Usage

from guardbench import benchmark

def moderate(
    conversations: list[list[dict[str, str]]],  # MANDATORY!
    # additional `kwargs` as needed
) -> list[float]:
    # do moderation
    # return list of floats (unsafe probabilities)

benchmark(
    moderate=moderate,  # User-defined moderation function
    model_name="My Guardrail Model",
    batch_size=1,              # Default value
    datasets="all",            # Default value
    metrics=["f1", "recall"],  # Default value
    # Note: you can pass additional `kwargs` for `moderate`
)

πŸ“– Examples

πŸ“š Documentation

Browse the documentation for more details about:

πŸ† Leaderboard

You can find GuardBench's leaderboard here. If you want to submit your results, please contact us.

πŸ‘¨β€πŸ’» Authors

  • Elias Bassani (European Commission - Joint Research Centre)

πŸŽ“ Citation

@inproceedings{guardbench,
    title = "{G}uard{B}ench: A Large-Scale Benchmark for Guardrail Models",
    author = "Bassani, Elias  and
      Sanchez, Ignacio",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1022",
    doi = "10.18653/v1/2024.emnlp-main.1022",
    pages = "18393--18409",
}

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

πŸ“„ License

GuardBench is provided as open-source software licensed under EUPL v1.2.

About

A Python library for guardrail models evaluation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.1%
  • Makefile 0.9%