Skip to content

Latest commit

 

History

History
174 lines (123 loc) · 9.91 KB

README.md

File metadata and controls

174 lines (123 loc) · 9.91 KB

The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction

[Project page 🌐] [ArXiv preprint 📃] [Video 🎞️]

supported versions Library GitHub license

This is the code implementation for the CVPR'23 paper The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction.

Abstract

Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales. Our proposed Temporal Progressive (TemPr) model is composed of multiple attention towers, one for each scale. The predicted action label is based on the collective agreement considering confidences of these towers. Extensive experiments over four video datasets showcase state-of-the-art performance on the task of Early Action Prediction across a range of encoder architectures. We demonstrate the effectiveness and consistency of TemPr through detailed ablations.

Dependencies

Ensure that the following packages are installed in your machine:

  • adaPool (version >= 0.2)
  • coloredlogs (version >= 14.0)
  • dataset2database (version >= 1.1)
  • einops (version >= 0.4.0)
  • ffmpeg-python (version >=0.2.0)
  • imgaug (version >= 0.4.0)
  • opencv-python (version >= 4.2.0.32)
  • ptflops (version >= 0.6.8)
  • torch (version >= 1.9.0)
  • torchinfo (version >= 1.5.4)
  • youtube-dl (version >= 2020.3.24)

You can install the available PyPi packages with the command below:

$ pip install coloredlogs dataset2database einops ffmpeg-python imgaug opencv-python ptflops torch torchvision youtube-dl

and compile the adaPool package as:

$ git clone https://github.com/alexandrosstergiou/adaPool.git && cd adaPool-master/pytorch && make install
--- (optional) ---
$ make test

Datasets

A custom format is used for the train/val label files of each datasets:

label youtube_id/id time_start(optional) time_end(optional) split

This can be done through the scripts provided in labels

We have tested our code over the following datasets:

Videos and image-based datasets

Based on the format that the dataset is stored on disk two options are supported by the repo:

  • Videos being stored in video files (e.g. .mp4,.avi,etc.)
  • Videos being stored in folders containing their frames in image files (e.g. .jpg)

By default it is assumed that the data are in video format however, you can overwrite this by setting the use_frames call argument to True/true.

Data directory format

We assume a fixed directory formatting that should be of the following structure:

<data>
|
└───<dataset>
        |
        └─── <class_i>
        │     │
        │     │─── <video_id_j>
        │     │         (for datasets w/ videos saved as frames)
        │     │         │
        │     │         │─── frame1.jpg
        │     │         └─── framen.jpg
        │     │    
        │     │─── <video_id_j+1>
        │     │         (for datasets w/ videos saved as frames)
        │     │         │
        │     │         │─── frame1.jpg
        │     │         └─── framen.jpg
       ...   ...

Usage

Training for each of the datasets is done through the homonym .yaml configuration scripts in configs.

You can also use the argument parsers in train.py and inference.py for custom arguments.

Examples

Train on UCF-101 with observation ratio 0.3, 3 scales, with movinet backbone, with the pretrained UCF-101 backbone checkpoint stored in weights, and over 4 gpus:

python train.py --video_per 0.3 --num_samplers 3 --gpus 0 1 2 3 --precision mixed --dataset UCF-101 --frame_size 224 --batch_size 64 --data_dir data/UCF-101/ --label_dir /labels/UCF-101 --workers 16 --backbone movinet --end_epoch 70 --pretrained_dir weights/UCF-101/movinet_ada_best.pth

Run inference over something-something v2 with TemPr and adaptive ensemble over a single gpu with checkpoint file my_chckpt.pth:

python inference.py --config config/inference/smthng-smthng/config.yml --head TemPr_h --pool ada --gpus 0 --pretrained_dir my_chckpt.pth

Calling arguments (for both train.py & inference.py)

The following arguments are used and can be included at the parser of any training script.

Argument name functionality
debug-mode Boolean for debugging messages. Useful for custom implementations/datasets.
dataset String for the name of the dataset. used in order to obtain the respective configurations.
data_dir String for the directory to load data from.
data_dir String for the directory to load the train and val splits (should be train.csv and val.csv).
clip-length Integer determining the number of frames to be used for each video.
clip-size Tuple for the spatial size (height x width) of each frame.
backbone String for the name of the feature extractor network.
accum_grads Integer for the number of iterations passed to run backwards. Set to 1 to not use gradient accumulation.
use_frames Boolean flag. When set to True the dataset directory should be a folder of .jpg images. Alternatively, video files.
head String for the name of the attention tower network. Only TemPr_h can be currently used.
pool String for the predictor aggregation method to be used.
gpus List for the number of GPUs to be used.
pretrained-3d String for .pth filepath the case that the weights are to be initialised from some previously trained model. As a non-strict weight loading implementation exists to remove certain works from the state_dict keys.
config String for the .yaml configuration file to be used. If arguments that are part of the configuration path are passed by the user, they will be selected over the YAML ones.

Checkpoints

UCF-101

Backbone $\rho=0.1$ $\rho=0.2$ $\rho=0.3$ $\rho=0.4$ $\rho=0.5$ $\rho=0.6$ $\rho=0.7$ $\rho=0.8$ $\rho=0.9$
x3d chkp chkp chkp chkp chkp chkp chkp chkp chkp
movinet chkp chkp chkp chkp chkp chkp chkp chkp chkp

SSsub21

Backbone $\rho=0.1$ $\rho=0.2$ $\rho=0.3$ $\rho=0.5$ $\rho=0.7$ $\rho=0.9$
movinet chkp chkp chkp chkp chkp chkp

Citation

@inproceedings{stergiou2023wisdom,
    title = {The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction},
    author = {Stergiou, Alexandros and Damen, Dima},
    booktitle = {IEEE/CVF Computer Vision and Pattern Recognition (CVPR)},
    year = {2023}
}

License

MIT