Note
@TheRootOf3 wrote the following because of an urgent need to write some text in a paper-like style at 2 am. May be complete trash. Reading at your own risk. The following text is provided "as is", without warranty of any kind...
Reliably removing undesired content from the outputs of a Language Model is one of the key challenges in the area of Large Language Models. Unlearning high-level tasks is useful in tackling varying forms of bias coming from pre-training datasets and allows a relatively inexpensive method for model alignment. Furthermore, the significant scale of pre-training datasets, makes it difficult to remove all unwanted personally identifiable information (PII) at the stage of data pre-processing. The unlearning paradigm offers an alternative to applying additional output filters and is much cheaper in comparison to re-training the entire model.
While there have been multiple methods proposed, we show that while they are capable of successfully unlearning particular concepts, some of these methods also significantly decrease the overall performance of a model. In our work, we discuss the importance of a robust unlearning benchmark, consider both task-specific and overall model performance.
We aim to evaluate Language Model unlearning with respect to the performance of an unlearned model and the efficiency of the unlearning process itself.
@TheRootOf3
The idea to evaluate how well the model unlearned task A:
- Have a benchmark for task A e.g. sentiment analysis.
- Have a set of benchmarks measuring the overall performance of a model.
- Compare the performance of the unlearned model with the original model on all tasks. Hope for the drop in performance on task A and similar performance on the rest of tasks.
We can use the existing Open LLM Leaderboard from huggingface. It has the following tasks:
- AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.
- HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
- MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
- TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
- Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
- GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
We can either submit our models to the dashboard (nay :/) and get them evaluated there or run the evaluation harness locally (yay!).
https://github.com/EleutherAI/lm-evaluation-harness provides all we need to evaluate our models -> we can evaluate directly using huggingface transformers library!
To run:
- Follow the installation steps on their github.
- Have a directory with a model AND a tokenizer.
- Run the following:
lm_eval --model hf \
--model_args pretrained=/path/to/model \
--tasks arc_challenge,hellaswag,truthfulqa,mmlu,winogrande,gsm8k \
--device cpu \
--batch_size auto \
--trust_remote_code True \
--output_path eval_results \
--use_cache "./.cache"
A couple of notes:
- choose the right device (look docs) - there is a problem with
mps
on Apple Silicon - I didn't manage to get it running because of dtypes of weights :/ - batch size can be auto calculated (max possible) or set by a user.
- For testing you can use a flag
--limit
, which allows to specify what fraction (float 0-1) of benchmarks you want to use. In general this is only useful for testing because evaluation must be done on full benchmark datasets.
I added eval_framework_tasks/evaluate_models.sh
, which gives a easy way to evaluate a given model. The script allows to evaluate a single model and takes two parameters:
- The name of the experiment (This is when you want to evaluate multiple models within an experiment)
- The name of the model from
/cs/student/projects1/2020/aszablew/SNLP/SNLP_GCW/llm_unlearn_ucl/snlp-unlearned-models/models/
you want to evaluate.
Example:
./evaluate_models.sh experiment1 opt1.3b_unlearned
@Davidyz
Branch david/relearn
implements the use relearning time as an evaluation metric for LLM unlearning.
This approach attempts to retrain the unlearned model on the forgetting dataset and compare the time (number of iterations/samples) before the model reaches the original performance before unlearning.
Therefore, this benchmark requires the original model (before unlearning) for computing the baseline loss.
python relearn/relearn.py --original_model <original_model> --unlearned_model <unlearned_model> --batch_size 3
It might be necessary to adjust the learning rate so that the rate of re-learning shows significant differences.
Use the script ./relearn/run_relearn.sh
to evaluate checkpoints of a model in
batch. --use_lora
is kept because turning lora off alternates the behaviour of
relearning.
@Willmish
To run unlearning on any dataset (currently WIP), added : --unlearning_dataset
param.
To use custom datasets you need to define a corresponding create_pku_dataloader_from_dataset()
function in utils.py.
You can use below, but do change the cache dir and names accordingly!
python3 unlearn_harm.py --unlearning_dataset=math_qa --model_name=facebook/opt-1.3b --model_save_dir=models/opt1.3b_unlearned_mathqa --log_file=logs/opt-1.3b-unlearn-mathqa.log --lr=1e-4 --cache_dir=/cs/student/projects1/2020/sduchnie/SNLP_GCW/.cache --use_quantized=True
python3 unlearn_harm.py --unlearning_dataset=PKU-Alignment/PKU-SafeRLHF --model_name=facebook/opt-1.3b --model_save_dir=models/opt1.3b_unlearned_harmful --log_file=logs/opt-1.3b-unlearn-harmful.log --lr=1e-4 --cache_dir=/cs/student/projects1/2020/sduchnie/SNLP_GCW/.cache --use_quantized=True