Skip to content

Latest commit

 

History

History
100 lines (74 loc) · 3.8 KB

README.md

File metadata and controls

100 lines (74 loc) · 3.8 KB

Transformer Encoder Reasoning and Alignment Network (TERAN)

Updates

  • 🔥 09/2022: The extension to this work (ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval) has been published in proceedings of CBMI 2022. Check out code and paper!

Introduction

Code for the cross-modal visual-linguistic retrieval method from "Fine-grained Visual Textual Alignment for Cross-modal Retrieval using Transformer Encoders", accepted for publication in ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) [Pre-print PDF].

This work is an extension to our previous approach TERN accepted at ICPR 2020.

This repo is built on top of VSE++ and TERN.

Fine-grained Alignment for Precise Matching

Retrieval

Setup

  1. Clone the repo and move into it:
git clone https://github.com/mesnico/TERAN
cd TERAN
  1. Setup python environment using conda:
conda env create --file environment.yml
conda activate teran
export PYTHONPATH=.

Get the data

Data and pretrained models be downloaded from this OneDrive link (see the steps below to understand which files you need):

  1. Download and extract the data folder, containing annotations, the splits by Karpathy et al. and ROUGEL - SPICE precomputed relevances for both COCO and Flickr30K datasets. Extract it:
tar -xvf data.tgz
  1. Download the bottom-up features for both COCO and Flickr30K. We use the code by Anderson et al. for extracting them. The following command extracts them under data/coco/ and data/f30k/. If you prefer another location, be sure to adjust the configuration file accordingly.
# for MS-COCO
tar -xvf features_36_coco.tgz -C data/coco

# for Flickr30k
tar -xvf features_36_f30k.tgz -C data/f30k

Evaluate

Extract our pre-trained TERAN models:

tar -xvf TERAN_pretrained_models.tgz

Then, issue the following commands for evaluating a given model on the 1k (5fold cross-validation) or 5k test sets.

python3 test.py pretrained_models/[model].pth --size 1k
python3 test.py pretrained_models/[model].pth --size 5k

Please note that if you changed some default paths (e.g. features are in another folder than data/coco/features_36), you will need to use the --config option and provide the corresponding yaml configuration file containing the right paths.

Train

In order to train the model using a given TERAN configuration, issue the following command:

python3 train.py --config configs/[config].yaml --logger_name runs/teran

runs/teran is where the output files (tensorboard logs, checkpoints) will be stored during this training session.

Visualization

WIP

Reference

If you found this code useful, please cite the following paper:

@article{messina2021fine,
  title={Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders},
  author={Messina, Nicola and Amato, Giuseppe and Esuli, Andrea and Falchi, Fabrizio and Gennaro, Claudio and Marchand-Maillet, St{\'e}phane},
  journal={ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)},
  volume={17},
  number={4},
  pages={1--23},
  year={2021},
  publisher={ACM New York, NY}
}

License

Apache License 2.0