Welcome to the official repository for Siren, a project aimed at understanding and mitigating harmful behaviors in large language models (LLMs). This repository contains the resources for reproducing the experiments described in our work.
The train
directory contains:
- Randomly selected subset of data samples.
- Configuration
.yaml
files for SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) training, prepared using the LLaMA-Factory.
The data
directory provides the benchmark dataset HarmfulBehaviors, a standardized resource for evaluating LLM jailbreak scenarios.
The siren
directory contains the code for reproducing our experiments.
To use Siren, follow these steps:
- You might need to download LLaMA-3-8B into a local directory first according to the instructions of meta-llama/Meta-Llama-3-8B-Instruct.
git clone https://github.com/YiyiyiZhao/siren.git
cd siren
pip install -r requirements.txt
Adaptors with names like sft_llama3_decop_1
indicate the first stage of post-training (only SFT).
Therefore, you need to first merge the SFT adaptors with the public LLMs (e.g., LLaMA-3).
The DPO post-training stage is applied to the merged models.
cd siren
- Make sure to modify lines 42 to 52 in the
merge_sft_base.py
script, replacing the placeholder paths with the actual paths where you want to save the merged model (SFT with Base). - Make sure to modify line 27 in the
merge_sft_base.py
script, replacing the placeholder paths with the actual path where meta-llama/Meta-Llama-3-8B-Instruct downloaded.
python merge_sft_dpo.py
When you have trained SFT and DPO adapters, update the specific paths in the file utils/generate_Queries_sft_dpo.py
, lines 11 to 30:
self.base_model_dict = {
"llama3": {"decop_1": "Your-Merged-Model-Path/sft_llama3/decop_1",
"decop_2": "Your-Merged-Model-Path/sft_llama3/decop_2",
"combined": "Your-Merged-Model-Path/sft_llama3/combined"},
"mistral": {"decop_1": "Your-Merged-Model-Path/sft_mistral/decop_1",
"decop_2": "Your-Merged-Model-Path/sft_mistral/decop_2",
"combined": "Your-Merged-Model-Path/sft_mistral/combined"},
"qwen": {"decop_1": "Your-Merged-Model-Path/sft_qwen/decop_1",
"decop_2": "Your-Merged-Model-Path/sft_qwen/decop_2",
"combined": "Your-Merged-Model-Path/sft_qwen/combined"},
}
self.adaptor_path_dict = {
"llama3": {
"decop_1": "Yiyiyi/sft_dpo_llama3_decop_1",
"decop_2": "Yiyiyi/sft_dpo_llama3_decop_1",
"combined": "Yiyiyi/sft_dpo_llama3_decop_1"
},
"mistral": {
"decop_1": "Yiyiyi/sft_dpo_llama3_decop_1",
"decop_2": "Yiyiyi/sft_dpo_llama3_decop_1",
"combined": "Yiyiyi/sft_dpo_llama3_decop_1"
},
"qwen": {
"decop_1": "Yiyiyi/sft_dpo_llama3_decop_1",
"decop_2": "Yiyiyi/sft_dpo_llama3_decop_1",
"combined": "Yiyiyi/sft_dpo_llama3_decop_1"
}
}
Set the API keys for querying external LLMs in utils/remote_llm_response.py
, lines 10 to 22:
self.gpt_api_keys = os.environ.get("OPENAI_API_KEYS")
self.gpt_api_key_list = self.gpt_api_keys.split(',')
self.gpt_base_url = os.environ.get("OPENAI_BASE_URL")
self.claude_api_keys = os.environ.get("ANTHROPIC_API_KEYS")
self.claude_api_key_list = self.claude_api_keys.split(',')
self.claude_base_url = os.environ.get("ANTHROPIC_BASE_URL")
self.gemini_api_keys = os.environ.get("GEMINI_API_KEYS")
self.gemini_api_key_list = self.gemini_api_keys.split(',')
self.gemini_base_url = os.environ.get("GEMINI_BASE_URL")
During our experiments, each remote LLM has more than one API key. You can modify this part to fit your settings in the utils/remote_llm_response.py
file.
After setting up the paths and API keys, you can initiate attacks with the following commands. Refer to commands.sh
or commands_remote.sh
for more examples.
CUDA_VISIBLE_DEVICES=0,1 python main_sft_dpo.py --attacker_model llama3 --adaptor decop_1 --victim_model mistral > sft_dpo_llama3_1_mistral.log 2>&1 &
CUDA_VISIBLE_DEVICES=1,2 python main_sft_dpo.py --attacker_model mistral --adaptor combined --victim_model llama3 > sft_dpo_mistral_com_llama3.log 2>&1 &
or
CUDA_VISIBLE_DEVICES=0 python main_sft_dpo.py --attacker_model qwen --adaptor decop_2 --victim_model claude-3-5-haiku-20241022 > sft_dpo_qwen_2_claude.log 2>&1 &
CUDA_VISIBLE_DEVICES=4 python main_sft_dpo.py --attacker_model llama3 --adaptor decop_1 --victim_model gpt-4o-2024-08-06 > sft_dpo_llama3_1_gpt.log 2>&1 &
CUDA_VISIBLE_DEVICES=1 python main_sft_dpo.py --attacker_model mistral --adaptor decop_2 --victim_model gemini-1.5-pro-latest > sft_dpo_mistral_2_ge.log 2>&1 &
After you have the interaction records, you can calculate ASR with
python cal_asr.py