Prompt Guard

Prompt Guard is a classifier model by Meta, trained on a large corpus of attacks, capable of detecting both explicitly malicious prompts (jailbreaks) as well as data that contains injected inputs (prompt injections). Upon analysis, it returns one or more of the following verdicts, along with a confidence score for each:

LABEL_0: benign (non-malicious input)
LABEL_1: malicious (prompt injection or jailbreak attempt)

Note: Prompt Guard 1 produced BENIGN, INJECTION, and JAILBREAK as output labels, but Prompt Guard 2 has shifted from a multi-label classifier to a binary classifier with LABEL_0 and LABEL_1 labels only.

This repository contains a Streamlit app for testing Prompt Guard. Note that you'll need an HuggingFace access token to access the model. For a more detailed writeup, see this blog post.

Here's a sample response by Prompt Guard upon detecting a prompt injection attempt.

Here's a sample response by Prompt Guard upon detecting a jailbreak attempt.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github		.github
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prompt-guard-injection.png		prompt-guard-injection.png
prompt-guard-jailbreak.png		prompt-guard-jailbreak.png
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prompt Guard

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Prompt Guard

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages