TransPose is a human pose estimation model based on a CNN feature extractor, a Transformer Encoder, and a prediction head. Given an image, the attention layers built in Transformer can efficiently capture long-range spatial relationships between keypoints and explain what dependencies the predicted keypoints locations highly rely on.
[arxiv 2012.14214] [paper] [demo-notebook]
TransPose: Keypoint Localization via Transformer, Sen Yang, Zhibin Quan, Mu Nie, Wankou Yang, ICCV 2021
We choose two types of CNNs as the backbone candidates: ResNet and HRNet. The derived convolutional blocks are ResNet-Small, HRNet-Small-W32, and HRNet-Small-W48.
Model | Backbone | #Attention layers | d | h | #Heads | #Params | AP (coco val gt bbox) | Download |
---|---|---|---|---|---|---|---|---|
TransPose-R-A3 | ResNet-S | 3 | 256 | 1024 | 8 | 5.2Mb | 73.8 | model |
TransPose-R-A4 | ResNet-S | 4 | 256 | 1024 | 8 | 6.0Mb | 75.1 | model |
TransPose-H-S | HRNet-S-W32 | 4 | 64 | 128 | 1 | 8.0Mb | 76.1 | model |
TransPose-H-A4 | HRNet-S-W48 | 4 | 96 | 192 | 1 | 17.3Mb | 77.5 | model |
TransPose-H-A6 | HRNet-S-W48 | 6 | 96 | 192 | 1 | 17.5Mb | 78.1 | model |
You can directly load TransPose-R-A4 or TransPose-H-A4 models with pretrained weights on COCO train2017 dataset from Torch Hub, simply by:
import torch
tpr = torch.hub.load('yangsenius/TransPose:main', 'tpr_a4_256x192', pretrained=True)
tph = torch.hub.load('yangsenius/TransPose:main', 'tph_a4_256x192', pretrained=True)
Model | Input size | FPS* | GFLOPs | AP | Ap .5 | AP .75 | AP (M) | AP (L) | AR | AR .5 | AR .75 | AR (M) | AR (L) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TransPose-R-A3 | 256x192 | 141 | 8.0 | 0.717 | 0.889 | 0.788 | 0.680 | 0.786 | 0.771 | 0.930 | 0.836 | 0.727 | 0.835 |
TransPose-R-A4 | 256x192 | 138 | 8.9 | 0.726 | 0.891 | 0.799 | 0.688 | 0.798 | 0.780 | 0.931 | 0.845 | 0.735 | 0.844 |
TransPose-H-S | 256x192 | 45 | 10.2 | 0.742 | 0.896 | 0.808 | 0.706 | 0.810 | 0.795 | 0.935 | 0.855 | 0.752 | 0.856 |
TransPose-H-A4 | 256x192 | 41 | 17.5 | 0.753 | 0.900 | 0.818 | 0.717 | 0.821 | 0.803 | 0.939 | 0.861 | 0.761 | 0.865 |
TransPose-H-A6 | 256x192 | 38 | 21.8 | 0.758 | 0.901 | 0.821 | 0.719 | 0.828 | 0.808 | 0.939 | 0.864 | 0.764 | 0.872 |
Note:
- we computed the average FPS* of testing 100 samples from coco val dataset (with batchsize=1) on a single NVIDIA 2080Ti GPU. The FPS may fluctuate up and down at different tests.
- We trained our different models on different hardware platforms: 1 x RTX2080Ti GPUs (TP-R-A4), 4 x TiTan XP GPUs (TP-H-S, TP-H-A4), and 4 x Tesla P40 GPUs (TP-H-A6).
Model | Input size | #Params | GFLOPs | AP | Ap .5 | AP .75 | AP (M) | AP (L) | AR | AR .5 | AR .75 | AR (M) | AR (L) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TransPose-H-S | 256x192 | 8.0M | 10.2 | 0.734 | 0.916 | 0.811 | 0.701 | 0.793 | 0.786 | 0.950 | 0.856 | 0.745 | 0.843 |
TransPose-H-A4 | 256x192 | 17.3M | 17.5 | 0.747 | 0.919 | 0.822 | 0.714 | 0.807 | 0.799 | 0.953 | 0.866 | 0.758 | 0.854 |
TransPose-H-A6 | 256x192 | 17.5M | 21.8 | 0.750 | 0.922 | 0.823 | 0.713 | 0.811 | 0.801 | 0.954 | 0.867 | 0.759 | 0.859 |
Given an input image, a pretrained TransPose model, and the predicted locations, we can visualize the spatial dependencies of the predicted locations with threshold for the attention scores.
TransPose-R-A4
with threshold=0.00
TransPose-R-A4
with threshold=0.01
TransPose-H-A4
with threshold=0.00
TransPose-H-A4
with threshold=0.00075
-
Clone this repository, and we'll call the directory that you cloned as ${POSE_ROOT}
git clone https://github.com/yangsenius/TransPose.git
-
Install PyTorch>=1.6 and torchvision>=0.7 from the PyTorch official website
-
Install package dependencies. Make sure the python environment >=3.7
pip install -r requirements.txt
-
Make output (training models and files) and log (tensorboard log) directories under ${POSE_ROOT} & Make libs
mkdir output log cd ${POSE_ROOT}/lib make
-
Download pretrained models from the releases of this repo to the specified directory
${POSE_ROOT} `-- models `-- pytorch |-- imagenet | |-- hrnet_w32-36af842e.pth | |-- hrnet_w48-8ef0771d.pth | |-- resnet50-19c8e357.pth |-- transpose_coco | |-- tp_r_256x192_enc3_d256_h1024_mh8.pth | |-- tp_r_256x192_enc4_d256_h1024_mh8.pth | |-- tp_h_32_256x192_enc4_d64_h128_mh1.pth | |-- tp_h_48_256x192_enc4_d96_h192_mh1.pth | |-- tp_h_48_256x192_enc6_d96_h192_mh1.pth
We follow the steps of HRNet to prepare the COCO train/val/test dataset and the annotations. The detected person results are downloaded from OneDrive or GoogleDrive. Please download or link them to ${POSE_ROOT}/data/coco/, and make them look like this:
${POSE_ROOT}/data/coco/
|-- annotations
| |-- person_keypoints_train2017.json
| `-- person_keypoints_val2017.json
|-- person_detection_results
| |-- COCO_val2017_detections_AP_H_56_person.json
| `-- COCO_test-dev2017_detections_AP_H_609_person.json
`-- images
|-- train2017
| |-- 000000000009.jpg
| |-- ...
`-- val2017
|-- 000000000139.jpg
|-- ...
python tools/test.py --cfg experiments/coco/transpose_r/TP_R_256x192_d256_h1024_enc4_mh8.yaml TEST.USE_GT_BBOX True
python tools/train.py --cfg experiments/coco/transpose_r/TP_R_256x192_d256_h1024_enc4_mh8.yaml
Great thanks for these papers and their open-source codes:HRNet, DETR, DarkPose
This repository is released under the MIT LICENSE.
If you find this repository useful please give it a star 🌟 or consider citing our work:
@inproceedings{yang2021transpose,
title={TransPose: Keypoint Localization via Transformer},
author={Yang, Sen and Quan, Zhibin and Nie, Mu and Yang, Wankou},
booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2021}
}