LLM Defense against Jailbreaks

This repository contains experiments related to testing and analyzing large language models (LLMs) in various scenarios. The scripts and results are organized into two main folders.

The primary focus is on:

Probing the robustness of LLMs through jailbreak attack experiments.
Exploring the use of LLMs as input classifiers, evaluating their performance and resource overhead.

🧠 These experiments lay the groundwork for future work, particularly in building Multi-Agent Systems (MAS).

Repository Structure

`notebooks/`

This folder contains Jupyter notebooks with:

Scripts for attacking LLMs – exploring model behavior in adversarial or manipulation-prone contexts.
Scripts using LLMs and BERT-base models as input classifiers – experiments where models classify input data based on provided prompts.

⚠️ Note: Some notebooks may appear as "Invalid Notebook" on GitHub due to missing 'state' in metadata.widgets, but they work fine when downloaded or opened in Google Colab.

`input_classification_output_files/`

This folder contains the results of the experiments, including:

Performance metrics (e.g., accuracy, precision, recall, F1-score) for various models used as input classifiers.
Hardware overhead data (e.g., memory and time consumption) related to running LLMs as classifiers.

⚠️ Note: Outputs generated during LLM attack experiments have been intentionally excluded to avoid sharing potentially harmful or sensitive content. Only non-harmful logs and summaries remain.

Environment

All experiments were conducted on Google Colab Pro, primarily to ensure longer session stability.

⚠️ Note: Although Colab Pro was used, the same NVIDIA T4 GPU used in these experiments is also available for free in Google Colab (standard version), as long as you have a regular Gmail account. You can reproduce all experiments without needing a Pro subscription.

Running the Experiments

To run the notebooks locally:

git clone https://github.com/kolorowyksiaze/DefJailbreakMAS.git
cd notebooks
jupyter notebook

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
input_classification_output_files		input_classification_output_files
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Defense against Jailbreaks

Repository Structure

`notebooks/`

`input_classification_output_files/`

Environment

Running the Experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Defense against Jailbreaks

Repository Structure

notebooks/

input_classification_output_files/

Environment

Running the Experiments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`notebooks/`

`input_classification_output_files/`

Packages