This is an implementation repository for our work in EMNLP 2021. Relation-aware Video Reading Comprehension for Temporal Language Grounding. arxiv paper
Our pre-trained models are available at SJTU jbox or baiduyun, passcode:xmc0 or Google Drive.
Clone the repository and move to folder:
git clone https://github.com/Huntersxsx/RaNet.git
cd RaNet
To use this source code, you need Python3.7+ and a few python3 packages:
- pytorch 1.1.0
- torchvision 0.3.0
- torchtext
- easydict
- terminaltables
- tqdm
We use the data offered by 2D-TAN, and the extracted features can be found at Box.
The folder structure should be as follows:
.
├── checkpoints
│ ├── best
│ │ ├── TACoS
│ │ ├── ActivityNet
│ │ └── Charades
├── data
│ ├── TACoS
│ │ ├── tall_c3d_features.hdf5
│ │ └── ...
│ ├── ActivityNet
│ │ ├── sub_activitynet_v1-3.c3d.hdf5
│ │ └── ...
│ ├── Charades-STA
│ │ ├── charades_vgg_rgb.hdf5
│ │ └── ...
│
├── experiments
│
├── lib
│ ├── core
│ ├── datasets
│ └── models
│
└── moment_localization
Please download the visual features from box drive and save it to the data/
folder.
Use the following commands for training:
- For TACoS dataset, run:
sh run_tacos.sh
- For ActivityNet-Captions dataset, run:
sh run_activitynet.sh
- For Charades-STA dataset, run:
sh run_charades.sh
Our trained model are provided in SJTU jbox or baiduyun, passcode:xmc0 or Google Drive. Please download them to the checkpoints/best/
folder.
Use the following commands for testing:
- For TACoS dataset, run:
sh test_tacos.sh
- For ActivityNet-Captions dataset, run:
sh test_activitynet.sh
- For Charades-STA dataset, run:
sh test_charades.sh
TACoS | Rank1@0.3 | Rank1@0.5 | Rank5@0.3 | Rank5@0.5 |
---|---|---|---|---|
RaNet | 43.34 | 33.54 | 67.33 | 55.09 |
ActivityNet | Rank1@0.5 | Rank1@0.7 | Rank5@0.6 | Rank5@0.7 |
---|---|---|---|---|
RaNet | 45.59 | 28.67 | 75.93 | 62.97 |
Charades (VGG) | Rank1@0.5 | Rank1@0.7 | Rank5@0.5 | Rank5@0.7 |
---|---|---|---|---|
RaNet | 43.87 | 26.83 | 86.67 | 54.22 |
Charades (I3D) | Rank1@0.5 | Rank1@0.7 | Rank5@0.5 | Rank5@0.7 |
---|---|---|---|---|
RaNet | 60.40 | 39.65 | 89.57 | 64.54 |
We greatly appreciate the 2D-Tan repository, gtad repository and CCNet repository. Please remember to cite the papers:
@inproceedings{gao2021relation,
title={Relation-aware Video Reading Comprehension for Temporal Language Grounding},
author={Gao, Jialin and Sun, Xin and Xu, Mengmeng and Zhou, Xi and Ghanem, Bernard},
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
pages={3978--3988},
year={2021}
}
@InProceedings{2DTAN_2020_AAAI,
author = {Zhang, Songyang and Peng, Houwen and Fu, Jianlong and Luo, Jiebo},
title = {Learning 2D Temporal Adjacent Networks forMoment Localization with Natural Language},
booktitle = {AAAI},
year = {2020}
}
@InProceedings{Xu_2020_CVPR,
author = {Xu, Mengmeng and Zhao, Chen and Rojas, David S. and Thabet, Ali and Ghanem, Bernard},
title = {G-TAD: Sub-Graph Localization for Temporal Action Detection},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
url={https://openaccess.thecvf.com/content_CVPR_2020/papers/Xu_G-TAD_Sub-Graph_Localization_for_Temporal_Action_Detection_CVPR_2020_paper.pdf},
month = {June},
year = {2020}
}
@INPROCEEDINGS{9009011,
author={Huang, Zilong and Wang, Xinggang and Huang, Lichao and Huang, Chang and Wei, Yunchao and Liu, Wenyu},
booktitle={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
title={CCNet: Criss-Cross Attention for Semantic Segmentation},
year={2019},
volume={},
number={},
pages={603-612},
doi={10.1109/ICCV.2019.00069}
}