This repository contains scripts to build Youtube Gesture Dataset. You can download Youtube videos and transcripts, divide the videos into scenes, and extract human poses. Please see the project page and paper for the details.
If you have any questions or comments, please feel free to contact me by email (youngwoo@etri.re.kr).
The scripts are tested on Ubuntu 16.04 LTS and Python 3.5.2.
- OpenPose (v1.4) for pose estimation
- PySceneDetect (v0.5) for video scene segmentation
- OpenCV (v3.4) for video read
- We uses FFMPEG. Use latest pip version of opencv-python or build OpenCV with FFMPEG.
- Gentle (Jan. 2019 version) for transcript alignment
- Download the source code from Gentle github and run ./install.sh. And then, you can import gentle library by specifying the path to the library. See
run_gentle.py
. - Add an option
-vn
to resample.py in gentle as follows:cmd = [ FFMPEG, '-loglevel', 'panic', '-y', ] + offset + [ '-i', infile, ] + duration + [ '-vn', # ADDED (it blocks video streams, see the ffmpeg option) '-ac', '1', '-ar', '8000', '-acodec', 'pcm_s16le', outfile ]
- Download the source code from Gentle github and run ./install.sh. And then, you can import gentle library by specifying the path to the library. See
-
Set config
- Update paths and youtube developer key in
config.py
(the directories will be created if not exist). - Update target channel ID. The scripts are tested for TED and LaughFactory channels.
- Update paths and youtube developer key in
-
Execute
download_video.py
- Download youtube videos, metadata, and subtitles (./videos/*.mp4, *.json, *.vtt).
-
Execute
run_openpose.py
- Run OpenPose to extract body, hand, and face skeletons for all vidoes (./skeleton/*.pickle).
-
Execute
run_scenedetect.py
- Run PySceneDetect to divide videos into scene clips (./clip/*.csv).
-
Execute
run_gentle.py
- Run Gentle for word-level alignments (./videos/*_align_results.json).
- You should skip this step if you use auto-generated subtitles. This step is necessary for the TED Talks channel.
-
Execute
run_clip_filtering.py
- Remove inappropriate clips.
- Save clips with body skeletons (./clip/*.json).
-
(optional) Execute
review_filtered_clips.py
- Review filtering results.
-
Execute
make_ted_dataset.py
- Do some post processing and split into train, validation, and test sets (./script/*.pickle).
Running whole data collection pipeline is complex and takes several days, so we provide the pre-built dataset for the videos in the TED channel.
Number of videos | 1,766 |
Average length of videos | 12.7 min |
Shots of interest | 35,685 (20.2 per video on average) |
Ratio of shots of interest | 25% (35,685 / 144,302) |
Total length of shots of interest | 106.1 h |
- [ted_raw_poses.zip]
[z01]
[z02]
[z03]
[z04]
[z05] (split zip files, Google Drive or OneDrive links, total 80.9 GB)
The result of Step 3. It contains the extracted human poses for all frames. - [ted_shots_of_interest.zip, 13.3 GB]
The result of Step 6. It contains shot segmentation results ({video_id}.csv files) and shots of interest ({video_id}.json files). 'clip_info' elements in JSON files have start/end frame numbers and a boolean value indicating shots of interest. The JSON files contain the extracted human poses for the shots of interest, so you don't need to download ted_raw_poses.zip unless the human poses for all frames are necessary. - [ted_gesture_dataset.zip, 1.1 GB]
The result of Step 8. Train/validation/test sets of speech-motion pairs.
We do not provide the videos and transcripts of TED talks due to copyright issues. You should download actual videos and transcripts by yourself as follows:
- Download and copy [video_ids.txt] file which contains video ids into
./videos
directory. - Run
download_video.py
. It downloads the videos and transcripts invideo_ids.txt
. Some videos may not match to the extracted poses that we provided if the videos are re-uploaded. Please compare the numbers of frames, just in case.
If our code or dataset is helpful, please kindly cite the following paper:
@INPROCEEDINGS{
yoonICRA19,
title={Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots},
author={Yoon, Youngwoo and Ko, Woo-Ri and Jang, Minsu and Lee, Jaeyeon and Kim, Jaehong and Lee, Geehyuk},
booktitle={Proc. of The International Conference in Robotics and Automation (ICRA)},
year={2019}
}
Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity (SIGGRAPH Asia 2020), https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context
- This work was supported by the ICT R&D program of MSIP/IITP. [2017-0-00162, Development of Human-care Robot Technology for Aging Society]
- Thanks to Eun-Sol Cho and Jongwon Kim for contributions during their internships at ETRI.