llm-adaptive-attacks/README.md at main · tml-epfl/llm-adaptive-attacks #957
Labels
AI-Agents
Autonomous AI agents using LLMs
AI-Chatbots
Topics related to advanced chatbot platforms integrating multiple AI models
Algorithms
Sorting, Learning or Classifying. All algorithms go here.
alignment
aligning llms and ais
Git-Repo
Source code repository like gitlab or gh
Papers
Research papers
source-code
Code snippets
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko (EPFL), Francesco Croce (EPFL), Nicolas Flammarion (EPFL)
Paper: https://arxiv.org/abs/2404.02151
Updates
jailbreak_artifacts
folder) in a convenient form following the JailbreakBench format.0.2.36
). We note that we performed our experiments with version0.2.23
where the system prompt is inserted. Thus, to reproduce our experiments, it's important to use the older version of FastChat (i.e.,pip install fschat==0.2.23
). The full story is here: Question about the system prompt used for llama-2 tml-epfl/llm-adaptive-attacks#2 (thanks to Zhaorui Yang for reporting this issue).Abstract
We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize the target logprob (e.g., of the token ``Sure''), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate---according to GPT-4 as a judge---on GPT-3.5/4, Llama-2-Chat-7B/13B/70B, Gemma-7B, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models---that do not expose logprobs---via either a transfer or prefilling attack with 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models---a task that shares many similarities with jailbreaking---which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection).
Main results
The table below summarizes our main evaluations:
The key element for query efficiency and high success rates for many models is self-transfer (i.e., reusing a successful adversarial string found by random search on a single behavior as an initialization for all behaviors - see the
adv_init
variable inmain.py
) as clearly evidenced by these plots:Getting started
This repository is partially based on the excellent repository from the PAIR paper (https://github.com/patrickrchao/JailbreakingLLMs).
To get started, install dependencies:
For experiments on GPT and Claude models, make sure you have the API key stored in
OPENAI_API_KEY
andANTHROPIC_API_KEY
respectively. For this, you can run:Similarly, some HuggingFace models, like Llama-2-Chat, require an API key, which can be set in
HF_TOKEN
.Run experiments
All experiments used in the paper are collected in the bash files in the
experiments
folder. There are 3 main files:main.py
: runs random search on all models with logprobs (HuggingFace and GPT models),main_claude_transfer.py
: runs a transfer attack on Claude models with a single adversarial suffix obtained on GPT-4,main_claude_prefilling.py
: runs a prefilling attack on Claude models.Get attack logs
The logs of all attacks can be found here in the
attack_logs
folder. The logs also include all intermediate and final adversarial suffixes.Trojan detection code
For the code used to obtain the 1st place in the SatML'24 Trojan Detection Competition, see https://github.com/fra31/rlhf-trojan-competition-submission.
Citation
If you find this work useful in your own research, please consider citing it:
License
This codebase is released under MIT License.
Suggested labels
None
The text was updated successfully, but these errors were encountered: