This project is the Pytorch implementation of Not All Frames Are Equal: Weakly-Supervised Video Grounding
with Contextual Similarity and Visual Clustering Losses in CVPR 2019.
Video Grounding Definition: Given a video segment with its language description, the aim is to localize objects query from the description to the video.
Note: this repository only provides the implementation for Finite Class Training mode for YouCookII Dataset.
- Python >= 3.6
- Pytorch >= 0.4.0 (<1.0.0)
git clone https://github.com/jshi31/NAFAE.git
- torchtext: torchtext is for obtaining glove feature. Install from torchtext
- opencv
- Please download the dataset from YouCookII to prepare YouCookII datasets.
We only need the folder
raw_videos
and the path to it is denoted as$RAW_VIDEO_DIR
.
Note: Please ensure that you downloaded all of the 2000 videos. If some videos are missing, please contact the authors to get them. - Parse video into frames
cd $ROOT/data/YouCookII
python genframes.py --video_dir $RAW_VIDEO_DIR
The generated frames are stored in sampled_frames_splnum-1
, under the same parent folder of $RAW_VIDEO_DIR
, then build a soft link to project directory as
ln -s $PATH_TO_sampled_frames_splnum-1 $ROOT/data/YouCookII/
- Test dataloader:
python $ROOT/lib/datasets/youcook2.py
It is safe if no error reported.
Create directory $ROOT/models/vgg16/pretrain/
We used faster RCNN with VGG16 backbone pretrained on Visual Gnome for region proposals. Download and put the VGG16 model as $ROOT/models/vgg16/pretrain/faster_rcnn_gnome.pth
As pointed out by ruotianluo/pytorch-faster-rcnn, choose the right -arch
in make.sh
file, to compile the cuda code:
GPU model | Architecture |
---|---|
TitanX (Maxwell/Pascal) | sm_52 |
GTX 960M | sm_50 |
GTX 1080 (Ti) | sm_61 |
Grid K520 (AWS g2.2xlarge) | sm_30 |
Tesla K80 (AWS p2.xlarge) | sm_37 |
Install all the python dependencies using pip:
pip install -r requirements.txt
Compile the cuda dependencies using following simple commands:
cd lib
sh make.sh
It will compile all the modules you need, including NMS, ROI_Pooing, ROI_Align and ROI_Crop.
./train.sh
Evaluate on test set
./test_model.sh
Evaluate on validation set
./eval_model.sh
Please change checksession, checkepoch, checkbatch to the same with the training setting .
- Visualize groundings
Specify the train_vis_freq and val_vis_freq as $n so that the the detected result is visualized in
$ROOT/output
every $n batches - Visualize training curve
tensorboard --logdir runs
In order to get the result in our paper, download and put the Final Model ([gcloud] or [baidu cloud, psword:nhq8]) into $ROOT/output/models/vgg16/YouCookII/
. Run
./test_model.sh
./eval_model.sh
macro box accuracy % | macro query accuracy % | |
---|---|---|
val | 39.48 | 41.23 |
test | 40.62 | 42.36 |
If you think this paper or repository is helpful, please cite
@inproceedings{shi2019not,
title={Not All Frames Are Equal: Weakly-Supervised Video Grounding With Contextual Similarity and Visual Clustering Losses},
author={Shi, Jing and Xu, Jia and Gong, Boqing and Xu, Chenliang},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={10444--10452},
year={2019}
}