- The CSV files that were used for the scores in the vilio paper are now available here
- Thanks to the initiative by katrinc, here are two notebooks for using Vilio to perform pure inference on any meme you want :)
- Just adapt the example input dataset / input model to use a different meme / pretrained model🥶
- GPU: https://www.kaggle.com/muennighoff/vilioexample-nb
- CPU: https://www.kaggle.com/muennighoff/vilioexample-nb-cpu
Vilio aims to replicate the organization of huggingface's transformer repo at: https://github.com/huggingface/transformers
-
/bash Shell files to reproduce hateful memes results
-
/data By default, directory for loading in data & saving checkpoints
-
/ernie-vil Ernie-vil sub-repository written in PaddlePaddle
-
/fts_lmdb Scripts for handling .lmdb extracted features
-
/fts_tsv Scripts for handling .tsv extracted features
-
/notebooks Jupyter Notebooks for demonstration & reproducibility
-
/py-bottm-up-attention Sub-repository for tsv feature extraction forked & adapted from here
-
src/vilio All implemented models (also see below for a quick overview of models)
-
/utils Pandas & ensembling scripts for data handling
-
entry.py files Scripts used to access the models and apply model-specific data preparation
-
pretrain.py files Same purpose as entry files, but for pre-training; Point of entry for pre-training
-
hm.py Training code for the hateful memes challenge; Main point of entry
-
param.py Args for running hm.py
Follow SCORE_REPRO.md for reproducing performance on the Hateful Memes Task.
Follow GETTING_STARTED.md for using the framework for your own task.
See the paper at: https://arxiv.org/abs/2012.07788
🥶 Vilio currently provides the following architectures with the outlined language transformers:
- E - ERNIE-VIL ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
- D - DeVLBERT DeVLBert: Learning Deconfounded Visio-Linguistic Representations
- O - OSCAR Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
- U - UNITER UNITER: UNiversal Image-TExt Representation Learning
- V - VisualBERT VisualBERT: A Simple and Performant Baseline for Vision and Language
- X - LXMERT LXMERT: Learning Cross-Modality Encoder Representations from Transformers
- Clean-up import statements, python paths & find a better way to integrate transformers (Right now all import statements only work if in main folder)
- Enable loading and running models just via import statements (and not having to clone the repo)
- Find a way to better include ERNIE-VIL in this repo (PaddlePaddle to Torch?)
- Move tokenization in entry files to model-specific tokenization similar to transformers
The code heavily borrows from the following repositories, thanks for their great work:
- https://github.com/huggingface/transformers
- https://github.com/facebookresearch/mmf
- https://github.com/airsplay/lxmert
@article{muennighoff2020vilio,
title={Vilio: State-of-the-art visio-linguistic models applied to hateful memes},
author={Muennighoff, Niklas},
journal={arXiv preprint arXiv:2012.07788},
year={2020}
}