Code and data for the paper "Exploiting Biased Models to De-bias Text: A Gender-Fair Rewriting Model"
First, setup a new virtual environment and install all necessary dependencies:
bash utils/create_venv.sh
We release the source and target files for our automatic evaluations as well as all model outputs in the automatic_evaluation/data
folder. The files are named as in Table 1 and Table 2 in the paper.
To compute the WER, call the following script with the corresponding paths to the HYPOTHESIS
and REFERENCE
files:
python automatic_evaluation/evaluate.py -r REFERENCE -h1 HYPOTHESIS
To compare two models and compute statistical significance, call the script like this:
python automatic_evaluation/evaluate.py -r REFERENCE -h1 HYPOTHESIS_1 -h2 HYPOTHESIS_2
The scripts/extract.sh
script uses grep filters to identify lines that contain gender-fair forms in a given folder with files. Note that if you extract from OSCAR, a single line is actually a document rather than an individual sentence. Sentence splitting and more fine-grained filtering of gender-fair forms will be done in the next step.
Variables you need to configure:
DATA_FOLDER
: where the original data is storedOUTPUT_FOLDER
: where you want the extracted lines to be storedLANG
: the language of interest, choices:de
,en
anden-fw
(for reproducing Forward Augmentation for English)FILETYPE
: whether the files aretxt
orgz
files (for OSCAR)
Next, you will use scripts/prepare.sh
to create biased pseudo source segments of the gender-fair target segments you extracted in the step above. The script will output parallel files of segments that contain gender-fair forms (end in gf.src
and gf.trg
) and parallel files of segments that only contain non-gendered forms (source and target are copies of each other - end in ngf.src
and ngf.trg
).
Variables you need to configure:
PATH_TO_VIRTUAL_ENVIRONMENT
: path to your venvDATA_FOLDER
: where the extracted data is storedLANG
: the language of interest, choices:de
,en
anden-fw
(for reproducing Forward Augmentation for English)CREATION_TYPE
: how you want to create sources, eitherrule-based
orround-trip
FILETYPE
: whether the files aretxt
,json
(for LM output) orjsonl
files (for OSCAR)
The script automatically creates concatenated training files of all files in the folder called train.gf.src
, train.gf.trg
, train.nfg.src
and train.ngf.trg
and saves them in the repository directory.
The scripts/filter.sh
script will deduplicate and filter the parallel data so that we can train our gender-fair rewriting models on cleaner data.
Variables you need to configure:
PATH_TO_VIRTUAL_ENVIRONMENT
: path to your venvLANG
: the language of interest, choices:de
,en
(also useen
foren-fw
)
You can now use train.filtered.gf.src
and train.filtered.gf.trg
as the training data for your gender-fair rewriting model. You may want to add some data that does not contain gendered forms from train.filtered.ngf.src
and train.filtered.ngf.trg
before you start the training. A ratio of 70-30 gendered to non-gendered was used for all experiments in the paper.
We provide our sockeye training configuration in scripts/train.sh
and decoding script in scripts/decode.sh
.
If you want to create additional training data, you can prompt large language models (as we did for singular forms in our paper). To reproduce the results in our paper download GerPT2-large
from HuggingFace
:
git clone https://huggingface.co/benjamin/gerpt2-large
Now you can generate more content based on the German seed nouns in data/de_seeds.json
:
SEED=0 # change this to get different results each time you run the generations
python3 scripts/generate_with_lm.py -d data/de_seeds.json -o lm_generations_$SEED.json -s $SEED
You can then run prepare.sh
as usual on the new data but don't forget to set the filetype to 'json'.
If you want to fine-tune the EN-DE machine translation model on gender-tagged segments based on pair forms, first download the training data from HuggingFace
:
git clone https://huggingface.co/datasets/wmt19
Now, create a tagged version of the data using:
python scripts/tag_dataset.py
Finally, you can finetune the already downloaded checkpoint of the wmt19-en-de
model on the tagged data wmt19-tagged
using the script scripts/finetune.sh
. We finetune our model in the paper for 50k steps.
Variables you need to configure:
PATH_TO_TRANSFORMERS
: Path to your local installation of the Transformers library
Data:
- OSCAR: CC By 4.0
- WMT 19 shared task data: unknown
- Sun et al. 2021 testsets: Apache 2.0
- Vanmassenhove et al. 2021 testsets: could not find license
- Diesner-Mayer and Seidel 2022 testsets: could not find license
Code:
- Sockeye: Apache 2.0
- OpusFilter: MIT License
- SentencePiece: Apache 2.0
- Transformers: Apache 2.0
- JiWER: Apache 2.0
Models:
- WMT19 de-en MT model: Apache 2.0
- WMT19 en-de MT model: Apache 2.0
- GerPT2 large LM model: MIT License
If you use this code or data, please cite our paper:
@inproceedings{amrhein-etal-2023-exploiting,
title = "Exploiting Biased Models to De-bias Text: A Gender-Fair Rewriting Model",
author = {Amrhein, Chantal and
Schottmann, Florian and
Sennrich, Rico and
L{\"a}ubli, Samuel},
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.246",
doi = "10.18653/v1/2023.acl-long.246",
pages = "4486--4506",
}