Federated Domain-adaptive Pre-training

This repository is the implementation of FDAPT: Federated Domain-adaptive Pre-training for Language Models.

Setup

Create python environment (e.g. conda create -n fdapt and conda activate fdapt)
Run the install: pip install -r requirements.txt

Data Preparation

Pre-training data

Download

cd data
wget https://huggingface.co/datasets/ccdv/pubmed-summarization/blob/main/train.zip
wget https://huggingface.co/datasets/ccdv/pubmed-summarization/blob/main/val.zip
unzip train.zip val.zip

Split

python3 split.py \
    --num_clients 8 \ 
    --output_dir 'noniid_voc/' \
    --noniid_type 'voc' \ # select from num, voc, len

Pre-processing

python3 get_dataloader.py \
    --model_name_or_path "distilbert-base-cased" \
    --per_device_train_batch_size 8 \
    --output_dir "data_dir/"
    --cache_dir "cache/" \
    --num_clients 8 \
    --dataset_dir "noniid_voc/" \

Downstream task data

# Download datasets
cd downstream-task
bash download.sh

Federate Domain-adaptive Pre-training

python3 fdabert.py \
    --model_name_or_path "distilbert-base-cased" \
    --fed_dir_data 'data/data_dir/' \
    --checkpointing_steps=epoch \
    --output_dir "saved_models/" \
    --cache_dir "cache/" \
    --num_train_epochs 1 \
    --num_rounds 15 \
    --num_clients 8 \
    --do_freeze True \

Evaluate on Downstream Tasks

Named Entity Recognition

cd downstream-task/named-entity-recognition

python3 finetune.py \
    --model_name_or_path  "saved_models_path" \
    --tokenizer_name "distilbert-base-cased" \
    --dataset_name "ncbi_disease" \
    --output_dir "results/" \
    --do_train \
    --do_eval \
    --do_predict \
    --num_train_epochs 20 \
    --per_gpu_train_batch_size 8 \
    --cache_dir "cache/" \
    --seed 42 \
    --overwrite_output_dir \
    --save_strategy 'epoch' \
    --evaluation_strategy 'epoch' \
    --load_best_model_at_end \
    --metric_for_best_model 'eval_f1' \
    --greater_is_better True \

Note: --dataset_name is selected from "ncbi_disease", "ghadeermobasher/BC5CDR-Chemical-Disease", "drAbreu/bc4chemd_ner", "bc2gm_corpus", "linnaeus", "species_800".

Question Answering

cd downstream-task/question-answering

export DATA_DIR=../datasets/QA/BioASQ
export data_name=BioASQ-test-factoid-7b.json 
export gold_name=7B_golden.json
export OFFICIAL_DIR=./scripts/bioasq_eval

# Train
python3 run_factoid.py \
    --model_type distilbert \
    --tokenizer_name 'distilbert-base-cased' \
    --model_name_or_path  "saved_models_path" \
    --do_train \
    --train_file "${DATA_DIR}/BioASQ-train-factoid-7b.json" \
    --per_gpu_train_batch_size 8 \
    --learning_rate 5e-5 \
    --num_train_epochs 10 \
    --seed 42 \
    --output_dir "results/" \
    --overwrite_output_dir \
    --cache_dir 'cache' \

# Evaluate 
python run_factoid.py \
    --model_type distilbert \
    --tokenizer_name 'distilbert-base-cased' \
    --model_name_or_path "saved_models_path" \
    --do_eval \
    --predict_file ${DATA_DIR}/${data_name[j]} \
    --golden_file ${DATA_DIR}/${gold_name[j]} \
    --per_gpu_eval_batch_size 8 \
    --seed 42 \
    --official_eval_dir ${OFFICIAL_DIR} \
    --output_dir "results/" \
    --overwrite_output_dir \
    --cache_dir 'cache' \

Relation Extraction

cd downstream-task/relation-extraction
bash preprocess.sh
pip install scikit-learn
pip install pandas

export SAVE_DIR=./output
export DATA="euadr" # eudar or gad
export SPLIT="1" # 1,2,3,4,5,6,7,8,9,10
export DATA_DIR=../datasets/RE/${DATA}/${SPLIT}
export ENTITY=${DATA}-${SPLIT}

# Train 
python run_re.py \
    --task_name SST-2 \
    --config_name distilbert-base-cased \
    --tokenizer_name distilbert-base-cased \
    --data_dir ${DATA_DIR} \
    --model_name_or_path  "saved_models_path" \
    --num_train_epochs 10 \
    --per_device_train_batch_size 8 \
    --seed 42 \
    --do_train \
    --do_predict \
    --learning_rate 5e-5 \
    --output_dir ${SAVE_DIR}/${ENTITY} \
    --overwrite_output_dir \
    --cache_dir "cache" \
    --overwrite_cache 
    
# Evaluate
python ./scripts/re_eval.py --output_path=${SAVE_DIR}/${ENTITY}/test_results.txt --answer_path=${DATA_DIR}/test.tsv

Acknowledgement

We would like to thank the authors for providing the code access to Flower, Hugging Face and BioBERT.

Citation

If you find this repo useful, please cite our work.

@inproceedings{jiang2023fdapt,
  title={FDAPT: Federated Domain-adaptive Pre-training for Language Models},
  author={Lekang Jiang and Filip Svoboda and Nicholas Donald Lane},
  booktitle={International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Federated Domain-adaptive Pre-training

Setup

Data Preparation

Pre-training data

Download

Split

Pre-processing

Downstream task data

Federate Domain-adaptive Pre-training

Evaluate on Downstream Tasks

Named Entity Recognition

Question Answering

Relation Extraction

Acknowledgement

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
downstream-task		downstream-task
utils		utils
README.md		README.md
fdabert.py		fdabert.py
requirements.txt		requirements.txt

scylj1/FDAPT

Folders and files

Latest commit

History

Repository files navigation

Federated Domain-adaptive Pre-training

Setup

Data Preparation

Pre-training data

Download

Split

Pre-processing

Downstream task data

Federate Domain-adaptive Pre-training

Evaluate on Downstream Tasks

Named Entity Recognition

Question Answering

Relation Extraction

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages