Generate an image of a human face based on that person's speech
The general aim of this project is to recreate and improve Speech-to-Face pipeline presented in the Speech2Face: Learning the Face Behind a Voice paper [1]
Whole implementation is based on PyTorch framework
Link to google drive with trained weights of face_decoder and ast models - drive
Link to recently published paper on this project - Face From Voice: PyTorch Adaptation of Speech2Face Framework
In this project you will find implementation of three models:
-
Voice Encoder In this project we used two different models:
- Architecture based on Speech2Face: Learning the Face Behind a Voice paper [1], we call it VE_conv (We trained this model from scratch)
- Audio Spectrogram Transformer (AST) - we used pre-trained model from Hugging Face and fine-tuned it
When using Speech-to-Face pipeline you can choose model which will be used
-
Face Encoder - architecture based on Deep Face Recognition paper [2]
In this project we didn't implement and train this model ourselves, we used existing trained models from:
- VGG-face model from github.com/serengil/deepface (in our project it is called
VGGFace_serengil
) [4] - VGG-face (16) model from github.com/rcmalli/keras-vggface (in our project it is called
VGGFace16_rcmalli
) [5]
When using Speech-to-Face pipeline or Face-to-Face pipeline you can choose model which will be used
- VGG-face model from github.com/serengil/deepface (in our project it is called
-
Face Decoder - architecture based on Synthesizing Normalized Faces from Facial Identity Features paper [3]
We trained this model from scratch
To read more about the project go to the page that you are interested in:
- Data preprocessing
- Converting audio files of speech to spectrograms (separate for VE_conv and AST)
- Converting images of faces to 4096-D feature vectors (face embeddings)
- Generating face landmarks for images of faces
- Resizing images
- Normalizing directory names
- Dataset classes
- Dataset structure for VoiceEncoder (the same for VE_conv and AST)
- Dataset structure for FaceDecoder
- Models
- VoiceEncoder
- FaceEncoder
- FaceDecoder
- Converting FaceEncoder model from TensorFlow to PyTorch
- Convert FaceEncoder implementation from TensorFlow to Pytorch (model and trained weights from github.com/serengil/deepface repository)
- Convert FaceEncoder implementation from TensorFlow to Pytorch (model and trained weights from github.com/rcmalli/keras-vggface repository)
- Train VoiceEncoder model and FaceDecoder model
- Scrpit for training VoiceEncoder model (separate for VE_conv and AST)
- Script for training FaceDecoder model
- Inference - usage of trained models
- Speech-to-Face pipeline: Use trained models (VE_conv/AST and FaceDecoder) to generate image of a face based on a person's speech
- Face-to-Face pipeline: Use trained models (FaceEncoder and FaceDecoder) to generate image of a face based on image of a face
In the project we used three different datasets:
- VoxCeleb1 - for human speech audio [6]
- VoxCeleb2 - for human speech audio [7]
- HQ-VoxCeleb - for normalized facial images [8]
HQ-VoxCeleb
dataset was used to train FaceDecoder model. To train VoiceEncoder model we filtered VoxCeleb1
and VoxCeleb2
datasets to get audio files for the identities present in HQ-VoxCeleb
(because HQ-VoxCeleb
does not contain normalized face images for every identity present in VoxCeleb1
or VoxCeleb2
datasets)
We achieved the best results using fine-tuned AST as VoiceEncoder model. Moreover, we used VGGFace_serengil as the FaceEncoder when training the VoiceEncoder and FaceDecoder models. The results obtained when using our trained from scratch VE_conv model were much worse. In the image below you can see the conclusion of our work. In the left column you can see the original image of the person from the HQ-VoxCeleb dataset. In the middle column you can see the recostruction of the face from the Face-to-Face pipeline (i.e. convert image to the face embbedding and reconstruct the image - voice is not used in this pipeline). Finally in the right column you can see the results from the Speech-to-Face pipeline (i.e. convert speech to the spectrogram, calculate face embedding from that spectrogram, reconstruct the face).
[1] Oh, Tae-Hyun, et al. "Speech2face: Learning the face behind a voice." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[2] Parkhi, Omkar, Andrea Vedaldi, and Andrew Zisserman. "Deep face recognition." BMVC 2015-Proceedings of the British Machine Vision Conference 2015. British Machine Vision Association, 2015.
[3] Cole, Forrester, et al. "Synthesizing normalized faces from facial identity features." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[4] github.com/serengil/deepface
[5] github.com/rcmalli/keras-vggface
[6] robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html
[7] robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html
[8] Bai, Yeqi, et al. "Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging." Proceedings of the 30th ACM International Conference on Multimedia. 2022.