This is a pytorch implementation of V2V-PoseNet(V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map), which is largely based on the author's torch7 implementation.
This repository provides
- V2V-PoseNet core modules(model, voxelization, ..)
- An experiment demo on MSRA hand pose dataset, result in ~11mm mean error.
- Additional Integral Pose Loss (or PoseFix Loss) implementation, result in ~10mm mean error on the same demo.
- pytorch 0.4.1 or pytorch 1.0
- python 3.6
- numpy
May need to disable cudnn for batchnorm, or just only use cuda instead. With cudnn for batchnorm and in float precision, the model cannot train well. My simple experiments show that:
cudnn+float: NOT work(e.g. the loss decreases much slower, and result in a higher loss)
cudnn+float+(disable batchnorm's cudnn): work(e.g. the loss decreases faster, and result in a lower loss)
cudnn+double: work, but the speed is slow
cuda+(float/double): work, but uses much more memroy
There is a similar issue pointed out by https://github.com/Microsoft/human-pose-estimation.pytorch. As suggested, disable cudnn for batchnorm:
PYTORCH=/path/to/pytorch
for pytorch v0.4.0
sed -i "1194s/torch\.backends\.cudnn\.enabled/False/g" ${PYTORCH}/torch/nn/functional.py
for pytorch v0.4.1
sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" ${PYTORCH}/torch/nn/functional.py
- Clone this repo:
git clone https://github.com/dragonbook/V2V-PoseNet-pytorch.git
cd V2V-PoseNet-pytorch
-
Download MSRA hand dataset and extract to directory path/to/msra-hand.
-
Download estimated centers of MSRA hand dataset which required by V2V-PoseNet and provided by the author's implementation. Extract them to the directory path/to/msra-hand-center.
Note, this repository contains a copy of the msra hand centers under ./datasets/msra_center.
- Configure data_dir=path/to/msra-hand and center_dir=path/to/msra-hand-center in ./experiments/msra-subject3/main.py. And Run following command to perform training and testing. It will train the dataset for few epochs and evaluate on the test dataset. The test result will be saved as test_res.txt and the fit result on training data will be saved as fit_res.txt
PYTHONPATH=./ python ./experiments/msra-subject3/main.py
-
Configure data_dir=path/to/msra-hand and center_dir=path/to/msra-hand-center in ./experiments/msra-subject3/gen_gt.py. Run it to generate ground truth labels as train_s3_gt.txt and test_s3_gt.txt
-
Configure pred_file=path/to/test_s3_gt.txt and gt_file=path/to/test_res.txt in ./experiments/msra-subject3/show_acc.py. Run it to plot accuracy and error.
-
The following figures show that the simple experiment can result in about 11mm mean error.
Additional IntegralPose/PoseFix style loss implementation
Replaced V2V-PoseNet's loss with PoseFix's loss(one-hot heatmap loss + L1 coord loss), and it's independently implemented under ./integral-pose directory. Also, configure data_dir and center_dir in ./integral-pose/main.py, and start training. The result shows about 10mm mean error.
V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map
This is our project repository for the paper, V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map (CVPR 2018).
We, Team SNU CVLAB, (Gyeongsik Moon, Juyong Chang, and Kyoung Mu Lee of Computer Vision Lab, Seoul National University) are winners of HANDS2017 Challenge on frame-based 3D hand pose estimation.
Please refer to our paper for details.
If you find our work useful in your research or publication, please cite our work:
[1] Moon, Gyeongsik, Ju Yong Chang, and Kyoung Mu Lee. "V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map." CVPR 2018. [arXiv]
@InProceedings{Moon_2018_CVPR_V2V-PoseNet,
author = {Moon, Gyeongsik and Chang, Juyong and Lee, Kyoung Mu},
title = {V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}
In this repository, we provide
- Our model architecture description (V2V-PoseNet)
- HANDS2017 frame-based 3D hand pose estimation Challenge Results
- Comparison with the previous state-of-the-art methods
- Training code
- Datasets we used (ICVL, NYU, MSRA, ITOP)
- Trained models and estimated results
- 3D hand and human pose estimation examples
Our code is tested under Ubuntu 14.04 and 16.04 environment with Titan X GPUs (12GB VRAM).
Clone this repository into any place you want. You may follow the example below.
makeReposit = [/the/directory/as/you/wish]
mkdir -p $makeReposit/; cd $makeReposit/
git clone https://github.com/mks0601/V2V-PoseNet_RELEASE.git
src
folder contains lua script files for data loader, trainer, tester and other utilities.data
folder contains data converter which converts image files to the binary files.
To train our model, please run the following command in the src
directory:
th rum_me.lua
- There are some optional configurations you can adjust in the config.lua.
- You have to convert the
.png
images of the ICVL and NYU dataset to the.bin
files by running the code fromdata
folder. - The directory where you have to put the dataset files and computed centers of each frame is defined in
src/data/dataset_name/data.lua
- Visualization code is finally uploaded! You have to prepare 'result_pixel.txt' for each dataset. Each row of the result file has to contain the pixel coordinates of x, y and depth of all joints (i.e, x1 y1 z1 x2 y2 z2 ...). Then run pixel2world script and run draw_DB.m
We trained and tested our model on the four 3D hand pose estimation and one 3D human pose estimation datasets.
- ICVL Hand Poseture Dataset [link] [paper]
- NYU Hand Pose Dataset [link] [paper]
- MSRA Hand Pose Dataset [link] [paper]
- HANDS2017 Challenge Dataset [link] [paper] [challenge benchmark paper]
- ITOP Human Pose Dataset [link] [paper]
Here we provide the precomputed centers, estimated 3D coordinates and pre-trained models.
The precomputed centers are obtained by training the hand center estimation network from DeepPrior++ . Each line represents 3D world coordinate of each frame. In case of ICVL, NYU, MSRA dataset, if depth map is not exist or not contain hand, that frame is considered as invalid. In case of ITOP dataset, if 'valid' variable of a certain frame is false, that frame is considered as invalid. All test images are considered as valid.
The 3D coordinates estimated on the ICVL, NYU and MSRA datasets are pixel coordinates and the 3D coordinates estimated on the ITOP datasets are world coordinates. The estimated results are from ensembled model. You can make the results from a single model by downloading the pre-trained model and testing it.
- ICVL Hand Poseture Dataset [center_trainset] [center_testset] [estimation] [models]
- NYU Hand Pose Dataset [center_trainset] [center_testset] [estimation] [models]
- MSRA Hand Pose Dataset [center] [estimation] [models]
- ITOP Human Pose Dataset (front-view) [center_trainset] [center_testset] [estimation] [models]
- ITOP Human Pose Dataset (top-view) [center_trainset] [center_testset] [estimation] [models]
We used awesome-hand-pose-estimation to evaluate the accuracy of the V2V-PoseNet on the ICVL, NYU and MSRA dataset.