This repository contains an implementation of BERT fine-tuning for Multi-label classification. In this case, the goal is to classify a document into one or more classes/labels. It is a generalization of the multiclass classification problem and the same implementation can be used for it. This implementation is based on BERT.
The usage is similar to BERT for other fine-tuning implementations, such as described in BERT's repository. To train a model run_classifier.py can be called:
python run_classifier.py --task_name=$TASK \
--do_train=true \
--do_predict=false \
--data_dir=$DATASET_DIR \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=.$BERT_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=10.0 \
--output_dir=$OUTPUT_DIR \
--add_dense=false
To predict samples, run_classifier.py can be called in the following way:
python run_classifier.py --task_name=$TASK \
--do_train=false \
--do_predict=true \
--data_dir=$DATASET_DIR \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=.$BERT_DIR/bert_model.ckpt \
--max_seq_length=128 \
--output_dir=$OUTPUT_DIR \
--add_dense=false\
--cut_off=0.5 \
--cutoff_type=static
In both cases, --task_name represents the task to be executed: the referred Processor class will be called if exists. The --add_dense parameter adds a dense layer to the logits from BERT's output layer. The final activation used is a sigmoid function, such that the final outputs consist on a vector, in which a position i in this vector represents a probability of belong to a class i. For this matter, the parameter --cut-off is the cut-off probability to belong to a class and the cut-off can be applied statically, via the cut_off parameter, or dynamically, learned from an dev set: the --cutoff_type can receive, then, static or dynamic. If no value is passed to --cutoff_type, then the predictor will assume the probabilities is the desired output.
To test this method I used four datasets available in the literature:
- Movie Lens: ML100k dataset from GroupLens. The title and summary of a movie to predict the genres of a movie. Task name is movielens.
- SE0714: SE0714 dataset from Deepmoji. The text is used to predict the emoji related to the text. Task name is se0714.
- PsychExpEmoji: SE0714 dataset from Deepmoji. The text is used to predict the emoji related to the text. Task name is psychexp.
- Toxic: Toxic Comments from Kaggle competition. Dataset used to predict the class of toxic comments in the text. For Toxic, only the probabilities of a commentary belonging to each class is predictedTask name is toxic.
The implementation was tested on these datasets. And the best results obtained on them are shown below. We don't show each individual class due to space limitations. For dynamic cut-off type, the dataset was divided in train, dev and test sets. For all cases BERT-Base uncased was used to train and test.
Dataset | Max Length | Training Epochs | Cut-off type | Add Dense-layer? | AUC | Hamming Loss | F1 |
---|---|---|---|---|---|---|---|
Movie Lens | 128 | 4 | Dynamic | No | 0.89554 | 0.09231 | 0.67236 |
SE0714 | 128 | 4 | Dynamic | Yes | 0.93413 | 0.0573 | 0.67855 |
PsychExpEmoji | 140 | 4 | Dynamic | No | 0.9227 | 0.08031 | 0.73281 |
Toxic | 140 | 2 | NA | No | 0.98606 | NA | NA |