exploiting-bias-to-debias

Code and data for the paper "Exploiting Biased Models to De-bias Text: A Gender-Fair Rewriting Model"

Installation

First, setup a new virtual environment and install all necessary dependencies:

bash utils/create_venv.sh

Reproduce Automatic Evaluation Results

We release the source and target files for our automatic evaluations as well as all model outputs in the automatic_evaluation/data folder. The files are named as in Table 1 and Table 2 in the paper.

To compute the WER, call the following script with the corresponding paths to the HYPOTHESIS and REFERENCE files:

python automatic_evaluation/evaluate.py -r REFERENCE -h1 HYPOTHESIS

To compare two models and compute statistical significance, call the script like this:

python automatic_evaluation/evaluate.py -r REFERENCE -h1 HYPOTHESIS_1 -h2 HYPOTHESIS_2

Filter Data for Gender-fair Forms

The scripts/extract.sh script uses grep filters to identify lines that contain gender-fair forms in a given folder with files. Note that if you extract from OSCAR, a single line is actually a document rather than an individual sentence. Sentence splitting and more fine-grained filtering of gender-fair forms will be done in the next step.

Variables you need to configure:

DATA_FOLDER: where the original data is stored
OUTPUT_FOLDER: where you want the extracted lines to be stored
LANG: the language of interest, choices: de, en and en-fw (for reproducing Forward Augmentation for English)
FILETYPE: whether the files are txt or gz files (for OSCAR)

Create Pseudo Data

Next, you will use scripts/prepare.sh to create biased pseudo source segments of the gender-fair target segments you extracted in the step above. The script will output parallel files of segments that contain gender-fair forms (end in gf.src and gf.trg) and parallel files of segments that only contain non-gendered forms (source and target are copies of each other - end in ngf.src and ngf.trg).

Variables you need to configure:

PATH_TO_VIRTUAL_ENVIRONMENT: path to your venv
DATA_FOLDER: where the extracted data is stored
LANG: the language of interest, choices: de, en and en-fw (for reproducing Forward Augmentation for English)
CREATION_TYPE: how you want to create sources, either rule-based or round-trip
FILETYPE: whether the files are txt, json (for LM output) or jsonl files (for OSCAR)

The script automatically creates concatenated training files of all files in the folder called train.gf.src, train.gf.trg, train.nfg.src and train.ngf.trg and saves them in the repository directory.

Filtering Parallel Data

The scripts/filter.sh script will deduplicate and filter the parallel data so that we can train our gender-fair rewriting models on cleaner data.

Variables you need to configure:

PATH_TO_VIRTUAL_ENVIRONMENT: path to your venv
LANG: the language of interest, choices: de, en (also use en for en-fw)

You can now use train.filtered.gf.src and train.filtered.gf.trg as the training data for your gender-fair rewriting model. You may want to add some data that does not contain gendered forms from train.filtered.ngf.src and train.filtered.ngf.trg before you start the training. A ratio of 70-30 gendered to non-gendered was used for all experiments in the paper.

Train Rule-Based Rewriter

We provide our sockeye training configuration in scripts/train.sh and decoding script in scripts/decode.sh.

Generating More Data Using LM

If you want to create additional training data, you can prompt large language models (as we did for singular forms in our paper). To reproduce the results in our paper download GerPT2-large from HuggingFace:

git clone https://huggingface.co/benjamin/gerpt2-large

Now you can generate more content based on the German seed nouns in data/de_seeds.json:

SEED=0 # change this to get different results each time you run the generations
python3 scripts/generate_with_lm.py -d data/de_seeds.json -o lm_generations_$SEED.json -s $SEED

You can then run prepare.sh as usual on the new data but don't forget to set the filetype to 'json'.

Finetuning MT Model On Pair Forms

If you want to fine-tune the EN-DE machine translation model on gender-tagged segments based on pair forms, first download the training data from HuggingFace:

git clone https://huggingface.co/datasets/wmt19

Now, create a tagged version of the data using:

python scripts/tag_dataset.py

Finally, you can finetune the already downloaded checkpoint of the wmt19-en-de model on the tagged data wmt19-tagged using the script scripts/finetune.sh. We finetune our model in the paper for 50k steps.

Variables you need to configure:

PATH_TO_TRANSFORMERS: Path to your local installation of the Transformers library

Original Licensing Information

Data:

OSCAR: CC By 4.0
WMT 19 shared task data: unknown
Sun et al. 2021 testsets: Apache 2.0
Vanmassenhove et al. 2021 testsets: could not find license
Diesner-Mayer and Seidel 2022 testsets: could not find license

Code:

Sockeye: Apache 2.0
OpusFilter: MIT License
SentencePiece: Apache 2.0
Transformers: Apache 2.0
JiWER: Apache 2.0

Models:

WMT19 de-en MT model: Apache 2.0
WMT19 en-de MT model: Apache 2.0
GerPT2 large LM model: MIT License

Citation

If you use this code or data, please cite our paper:

@inproceedings{amrhein-etal-2023-exploiting,
title = "Exploiting Biased Models to De-bias Text: A Gender-Fair Rewriting Model",
author = {Amrhein, Chantal  and
  Schottmann, Florian  and
  Sennrich, Rico  and
  L{\"a}ubli, Samuel},
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.246",
doi = "10.18653/v1/2023.acl-long.246",
pages = "4486--4506",
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
automatic_evaluation		automatic_evaluation
data		data
gfrwriter		gfrwriter
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

exploiting-bias-to-debias

Installation

Reproduce Automatic Evaluation Results

Filter Data for Gender-fair Forms

Create Pseudo Data

Filtering Parallel Data

Train Rule-Based Rewriter

Generating More Data Using LM

Finetuning MT Model On Pair Forms

Original Licensing Information

Citation

About

Releases

Packages

Languages

textshuttle/exploiting-bias-to-debias

Folders and files

Latest commit

History

Repository files navigation

exploiting-bias-to-debias

Installation

Reproduce Automatic Evaluation Results

Filter Data for Gender-fair Forms

Create Pseudo Data

Filtering Parallel Data

Train Rule-Based Rewriter

Generating More Data Using LM

Finetuning MT Model On Pair Forms

Original Licensing Information

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages