Please go to the original repo vladsandulescu/hatefulmemes in case that there are new changes.
While I was cleaning up the code and documenting my best leaderboard solution I discovered
I had made a mistake when training the model. Specifically the model was being trained
after the first epoch in eval mode only, i.e. no dropout and batchnorm.
This is obviously a mistake and you shouldn't do it. Turns out that a single UNITER
model might be on par with my best leaderboard solution involving 12 models with paired-attention,
which seems to have little to no effect. As a consequence I expect now an ensemble of plain UNITER models to outperform
my best leaderboard solution. Imagine my unpleasant surprise finding this after the competition ended.
Oh well, so if you want to do it the right way, make sure to include
model.train()
here,
before returning the test results, similar to the validation part.
NOTE: The rest of the documentation below reflects the state of the solution until the competition ended and since the organizers need to reproduce my solution, I will not push the post-competition fix just yet.
My best scoring solution to the Hateful Memes: Phase 2 challenge comprises of an ensemble of a single UNITER model architecture (average over probabilities) [paper] [code], which I have adapted for this competition. I have customized the paired-attention approach from the UNITER paper to include image captions inferred from the Im2txt implemention of the Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge paper. Albeit it didn't improve my results by a lot (around 1% for the AUC). However I really liked the simplicity of it. I extracted ROI boxes and image features by adapting the Bottom-up Attention with Detectron2 work which in itself is a pytorch adaptation of the original bottom-up-attention approach using caffe.
In a nutshell, the paired attention approach was used in the UNITER paper to pair images for the NLVR competition. There the model input was a pair [[img1, txt], [img2, txt]], thus repeating the text for each pair. I basically just turned that on its head and did [[img, txt1], [img, txt2]] where the first text was the supplied meme text and the second text was the caption from running inference on the images using the Im2txt code.
The overall goal was to get familiar with the multimodal SOTA models out there and not focus too much on fancier ensembling or stacking. I find it much more fun to try to improve a single architecture. I also haven't spent much time tuning the hyperparameters, you will notice they are quite similar to the ones the UNITER authors used.
A lot of info up to here, let's take it step by step. If you want to replicate my solution, here's what you need to do.
Please read the instructions entirely first, before starting the process, to ensure you have at last some overview before you start. It is also very useful to read the installation instructions from the original repos I have used to get an even better overview. In my experience, running somebody else's scripts on your own instance setup rarely works out of the box the first time.
- Ubuntu 16.04, CUDA 9.0, GCC 5.4.0
- Anaconda 3
- Python 3.6.x (pandas and jupyterlab needed)
The project consists of three sub-projects which have been adapted for this task:
- Bottom-up Attention with Detectron2
- Im2txt: image captioning inference
- UNITER: UNiversal Image-TExt Representation Learning
I strongly recommend creating a separate conda environment for each. For this there is a script in the conda folder in each of the sub-projects. They are all independent, so they should each work as a standalone project, requiring only a path to the data folder. Sure there is a flow, with py-bottom-up-attention and Im2txt working as feature extractors for UNITER down the line. It is more for convenience I bundled them in one package in order to document the solution. Once you cloned the repo, you should see this structure and among other things, there are three scripts:
hatefulmemes
├── Im2txt
│ ├── conda
│ │ ├── init_im2txt_ubuntu.sh
│ ├── ...
├── py-bottom-up-attention
│ ├── conda
│ │ ├── init_bua_ubuntu.sh
│ ├── ...
└── UNITER
├── conda
│ ├── init_uniter_ubuntu.sh
└── ...
Please pay attention to the /path/to/
and anaconda3
paths in the scripts.
You need to change this to the location of your cloned repo and conda respectively.
Afterwards run the scripts to create each of the conda environments.
If you prefer not to, or you cannot make it work using the scripts above, you can also follow the installation instructions from the original repos. That's where my scripts come from anyway.
Grab the HM dataset for phase 2
and place the unzipped files inside a data
folder, under the project root for example
(on the same level with Im2txt
, py-bottom-up-attention
and UNITER
).
Grab the dev_seen_unseen.jsonl file and place it in the same folder as the other jsonl
files
provided by the organizers.
NOTE: Since no good data should go to waste, I have merged the dev_seen and dev_unseen sets. This increased the size of the validation set to 640 samples, instead of 500. Also the final predictions on the test_unseen set have been produced by training the UNITER model on train+dev_seen_unseen.
If you don't want to readily download the dev_seen_unseen.jsonl file,
you can also build it yourself by running the notebooks/ph2_merge_dev_seen_unseen.ipynb
notebook.
The required pretrained models should be downloaded automatically when running it.
Otherwise please follow the original instructions and download the 10-100 boxes original model weights. Or take the shortcut below, but please always check the original link since the author might have made changes in the meantime.
wget --no-check-certificate http://nlp.cs.unc.edu/models/faster_rcnn_from_caffe_attr_original.pkl -P ~/.torch/fvcore_cache/models/
For Im2txt follow the original instructions, or take the shortcut below, but please always check the original link since the author might have made changes in the meantime.
"Download inceptionv3 finetuned parameters over 1M
and you will get 4 files, and make sure to put them all into this path
Im2txt/im2txt/model/Hugh/train/
- newmodel.ckpt-2000000.data-00000-of-00001
- newmodel.ckpt-2000000.index
- newmodel.ckpt-2000000.meta
- checkpoint "
For Uniter follow the original instructions, more specifically the pretrained UNITER-large model, or take the shortcut below to download the pretrained model, but please always check the original link since the author might have made changes in the meantime.
NOTE: the path to the pretrained model has to match the path in your training config file.
wget https://convaisharables.blob.core.windows.net/uniter/pretrained/uniter-large.pt -P /path/to/UNITER/storage/pretrained/
You can download the already extracted bboxes + features files
here.
Otherwise you can extract them yourself using the following code from py-bottom-up-attention
.
The code extracts the features and places the tsv
file for all the images in the imgfeat
folder.
This will take about one hour to execute (NVidia V100 GPU). However, note that you will implicitly
shuffle the training set by doing so and UNITER's sampler will cause training batches to be different
than my training batches. This means that even if you replicate the environment precisely, there might
still be small differences in results. But since the final step is an average over probabilities, the
overall differences on the test_unseen set should be negligible.
conda activate bua
python demo/detectron2_mscoco_proposal_maxnms_hm.py
--split img
--data_path /path/to/data/
--output_path /path/to/data/imgfeat/
--output_type tsv
--min_boxes 10 --max_boxes 100
Split up the tsv
file into each of the sets, by joining with the jsonl
files. This will
create another set of tsv
files, which will be ingested by UNITER.
python demo/hm.py --split img --split_json_file train.jsonl --d2_file_suffix d2_10-100_vg --data_path /path/to/data/ --output_path /path/to/data/imgfeat/
python demo/hm.py --split img --split_json_file dev_seen_unseen.jsonl --d2_file_suffix d2_10-100_vg --data_path /path/to/data/ --output_path /path/to/data/imgfeat/
python demo/hm.py --split img --split_json_file test_unseen.jsonl --d2_file_suffix d2_10-100_vg --data_path /path/to/data/ --output_path /path/to/data/imgfeat/
This is how your data
folder structure should look like now:
hatefulmemes
├── data
│ ├── img (all the memes, .png files)
│ ├── train.jsonl
│ ├── dev_seen.jsonl
│ ├── dev_unseen.jsonl
│ ├── dev_seen_unseen.jsonl
│ ├── test_unseen.jsonl
│ ├── imgfeat/d2_10-100_vg/tsv/img.tsv
│ ├── data_train_d2_10-100_vg.tsv
│ ├── data_dev_seen_unseen_d2_10-100_vg.tsv
│ ├── data_test_unseen_d2_10-100_vg.tsv
│ ├── ...
├── Im2txt
├── py-bottom-up-attention
└── UNITER
Download the already inferred captions file
here and place it under data/im2txt
folder.
Otherwise, run the following code yourself.
This will take about one hour on a multicore CPU only machine.
conda activate im2txt
python im2txt/run_inference.py
--checkpoint_path="im2txt/model/Hugh/train/newmodel.ckpt-2000000"
--vocab_file="im2txt/data/Hugh/word_counts.txt"
--input_files="/path/to/data/img/*.png"
Take the output csv file df_ph2.csv
and remember to place it under data/im2txt
folder.
This is how your data
folder structure should look like now:
hatefulmemes
├── data
│ ├── im2txt/df_ph2.csv
│ ├── ...
│ ├── img (all the memes, .png files)
│ ├── train.jsonl
│ ├── dev_seen.jsonl
│ ├── dev_unseen.jsonl
│ ├── dev_seen_unseen.jsonl
│ ├── test_unseen.jsonl
│ ├── imgfeat/d2_10-100_vg/tsv/img.tsv
│ ├── data_train_d2_10-100_vg.tsv
│ ├── data_dev_seen_unseen_d2_10-100_vg.tsv
│ ├── data_test_unseen_d2_10-100_vg.tsv
│ ├── ...
├── Im2txt
├── py-bottom-up-attention
└── UNITER
Training just one UNITER model on both train+dev_seen_unseen
takes about one hour on
a Nvidia V100 GPU with 32GB RAM. If you only have 16GB RAM on your GPU,
you have to modify train_batch_size
and gradient_accumulation_steps
to keep the
effective batch size the same.
E.g. If you reduce the train_batch_size by half to 3328, the gradient_accumulation_steps needs to be doubled to 2.
Also the current implementation does not support distributed training.
NOTE: If you modify the config file however, you will of course get slightly different results, but the overall difference should be negligible when averaging over probabilities output from multiple UNITER models.
To replicate my leaderboard solution, you need to train 12 UNITER models with different seeds. Why 12? Well, because the number has a certain something to it, there were 12 monkeys, or 12 full lunations of the moon in a year, or the number of years for a full cycle of Jupiter, and the reasons can go on and on I guess.
The final probabilities should be the average over the probabilities from the 12 model ensemble.
The only difference between these UNITER models is simply the seed. Training is done as mentioned
before on both train+dev_seen_unseen
and every valid_steps
inference is performed on test_unseen
,
writing the results to a csv file.
conda activate uniter
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_0.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_24.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_42.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_77.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_2018.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_12345.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_32768.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_54321.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_10101010.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_20200905.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_55555555.json
python train_hm.py --config config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_2147483647.json
NOTE: You should only care about the files test_results_1140_rank0_final.csv
, the rest are just intermediate results.
Use the notebooks/ph2_leaderboard.ipynb
to generate the final leaderboard results, but make sure to change the UNITER
output paths first.
You can download the checkpoints for all of the 12 models and then just run inference to get the predictions without training the models.
In order to run inference on the test set, you need to have two json files in place next to a checkpoint.
/path/to/train_dir/log/model.json
and /path/to/train_dir/log/hps.json
. These should match
the default pretrained UNITER-large config/uniter-large.json
file and
your UNITER training file respectively,
e.g. config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_0.json
for the model using seed 0.
You can simply copy and rename the two files to model.json
and hps.json
, but do keep in mind the different seed from each checkpoint.
NOTE: If you don't download the checkpoints and prefer to train the model yourself, you don't need to copy these files, since the training part will already copy them in the correct folder.
The structure of the checkpoint and log folders should look like this:
├── seed_0
│ ├── ckpt
│ │ ├── model_step_1140.pt
│ ├── log
│ │ ├── model.json (*NOTE* content matches config/uniter-large.json)
│ │ ├── hps.json (*NOTE* content matches config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_0.json)
│ ├── ...
├── seed_24
│ ├── ckpt
│ │ ├── model_step_1140.pt
│ ├── log
│ │ ├── model.json (*NOTE* content matches config/uniter-large.json)
│ │ ├── hps.json (*NOTE* content matches config/ph2_uniter_seeds/train-hm-large-pa-1gpu-hpc_24.json)
├── ...
Run inference on the test_unseen
set.
python inf_hm.py --root_path ./ --dataset_path /path/to/data \
--test_image_set test --train_dir /path/to/train_dir \
--ckpt 1140 --output_dir /path/to/output \
--fp16
If you find this code useful for your research, please consider citing:
@article{sandulescu2020detecting,
title={Detecting Hateful Memes Using a Multimodal Deep Ensemble},
author={Sandulescu, Vlad},
journal={arXiv preprint arXiv:2012.13235},
year={2020}
}