This repository is the official implementation of Affordance Grounding from Demonstration Video to Target Image:
@inproceedings{afformer,
author = {Joya Chen and Difei Gao and Kevin Qinghong Lin and Mike Zheng Shou},
title = {Affordance Grounding from Demonstration Video to Target Image},
booktitle = {CVPR},
year = {2023},
}
We now support PyTorch 2.0. Other version should be okay.
conda install -y pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia
NOTE: If you want to use PyTorch 2.0, you should install CUDA >= 11.7. See https://pytorch.org/.
We use PyTorch Lightning 2.0 as the training and inference engines.
pip install lightning jsonargparse[signatures] --upgrade
We use memory-efficient attention in xformers. Currently PyTorch 2.0 does not support memory-efficient attention relative positional encoding (see pytorch/issues/96099). We will update this repo when PyTorch supports this.
pip install triton --upgrade
pip install --pre xformers
We borrow some implementations from timm and detectron2.
pip install timm opencv-python av imageio --upgrade
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
-
Downloading OPRA dataset from https://github.com/kuanfang/opra. Due to the copyright issue, you may need to download the original video from YouTube.
-
We have uploaded our organized annotation json files to datasets/opra/annotations. Now your datasets file tree should be:
datasets
└── opra
├── annotations
│ ├── test.json
│ ├── train.json
├── clips
│ ├── aocom
│ ├── appliances
│ ├── bestkitchenreview
│ ├── cooking
│ ├── eguru
│ └── seattle
└── images
├── aocom
├── appliances
├── bestkitchenreview
├── cooking
├── eguru
└── seattle
- We are working on organizing EPIC-Hotspot and AssistQ Buttons. They will be released as soon as possible.
Hint: We recommend you to read LightningCLI if you firstly use it. That helps you better use these commands.
-
You dont need to manually download pre-trained encoder weight.
torchvision
will automatically download it. See torchvision.models.detection.fasterrcnn_resnet50_fpn_v2 for details. -
Training Afformer with ResNet-50-FPN encoder with
python main.py fit --config configs/opra/r50fpn.yaml --trainer.devices 8 --data.batch_size_per_gpu 2
- The training log is saved in
outputs/
by default. You can launch a tensorboard to monitor this folder:
tensorboard --logdir outputs/ --port 2333
# Then you can see real-time losses, metrics at http://localhost:2333/
- The evaluation would be done each 1k iterations during training. You can also evaluate with the
validate
command. For example,
python main.py validate --config configs/opra/r50fpn.yaml --trainer.devices 8 --data.batch_size_per_gpu 2 --ckpt outputs/opra/r50fpn/lightning_logs/version_0/checkpoints/xxxx.ckpt
-
Downloading ViTDet-B-COCO weights and then put it to weights/ folder:
weights/mask_rcnn_vitdet_b_coco.pkl
. -
Training Afformer with ViTDet-B encoder with
python main.py fit --config configs/opra/vitdet.yaml --trainer.devices 8 --data.batch_size_per_gpu 2
- The training log is saved in
outputs/
by default. You can launch a tensorboard to monitor this folder:
tensorboard --logdir outputs/ --port 2333
# Then you can see real-time losses, metrics at http://localhost:2333/
- The evaluation would be done each 1k iterations during training. You can also evaluate with the
validate
command. For example,
python main.py validate --config configs/opra/vitdet_b.yaml --trainer.devices 8 --data.batch_size_per_gpu 2 --ckpt outputs/opra/vitdet_b/lightning_logs/version_0/checkpoints/xxxx.ckpt
python demo.py --config configs/opra/vitdet_b.yaml --weight weights/afformer_vitdet_b_v1.ckpt --video demo/video.mp4 --image demo/image.jpg --output demo/output.gif
- Hint: we carefully fine-tuned a very strong ViTDet model, which is better than paper reported. Download it.
NOTE: A detailed tutorial will be done as soon as possible.
-
Downloading our trained hand interaction detector weights in this url. Then put it to weights/ folder:
weights/hircnn_r50fpnv2_849.pth
. -
The video demo by this hand interaction detector:
- Hint: we trained this simple and accurate hand interaction detector using 100DOH + some Ego datasets. It achieves 84.9 hand+interaction detection AP on 100DOH test set. For MaskAHand pre-training, this weight is enough. We will release its full source code at chenjoya/hircnn as soon as possible.
-
Make sure your data preparation follows Dataset part.
-
Running affominer/miner.py. The generated data will be saved at
affominer/outputs
.
This would be done during training. You can set the hyper-parameters in configs/opra/maskahand/pretrain.yaml:
mask_ratio: 1.0
num_masks: 2
distortion_scale: 0.5
num_frames: 32
clip_interval: 16
contact_threshold: 0.99
python main.py fit --config configs/opra/maskahand/pretrain.yaml
- Fine-tuning the MaskAHand pre-trained weight by
python main.py fit --config configs/opra/maskahand/finetune.yaml
- Zero-shot evaluate the MaskAHand pre-trained weight by
python main.py validate --config configs/opra/maskahand/pretrain.yaml
You can refer to demo.py to visualize your model results.
This repository is developed by Joya Chen. Questions and discussions are welcome via joyachen@u.nus.edu.
Thanks to all co-authors of the paper, Difei Gao, Kevin Qinghong Lin, and Mike Shou (my supervisor). Also appreciate the assistance from Dongxing Mao and Jiawei Liu.