English to Portuguese, Spanish, German, Italian, French Translation: Guide to Real-Time Face-to-Face Translation using LipSync GANs - Analytics India Magazine

Saturday, March 6, 2021

Guide to Real-Time Face-to-Face Translation using LipSync GANs - Analytics India Magazine - Translation

State-of-the-art Neural Machine Translation systems have become increasingly competent in automatically translating natural languages. These systems have not only become formidable in plain text-to-text translation tasks but have also made a considerable leap in speech-to-speech translation tasks. With the development of such systems, we are getting closer and closer to overcoming language barriers. However, there is still a medium that these systems need to tackle- videos. As far as videos are concerned we are still stuck with transcripts, subtitles, and manual dubs. And the translation systems that do exist can only translate the audiovisual content at the speech-to-speech level. This creates two flaws- the translated voice sounds significantly different from the original speaker, the generated audio and the lip movements are unsynchronized.

In their paper “Towards Automatic Face-to-Face Translation”, Prajwal K R et al tackle both these issues. They propose a new model LipGAN that generates realistic talking face videos across languages. And to work around the issue of personalizing the speaker’s voice, they make use of the CycleGAN architecture.

Pipeline for Face-to-Face Translation

In the very first phase of the pipeline, DeepSearch 2 Automatic Speech Recognition(ASR) model is used to transcribe the audio. To translate the text from language A to language B the Transformer-Base available in fairseq-py is re-implemented by training a multiway model to maximize learning. The trained model has parameters that are shared across seven languages – Hindi, English, Telugu, Malayalam, Tamil, Telugu, and Urdu.

DeepVoice 3 is employed for the text-to-speech(TTS) conversion, this model only generates the audio in one voice. A CycleGAN architecture model trained with 10 minutes of target’s audio clip is used to personalize the audio to match the voice of the target speaker.

This personalized audio is passed to the lip-sync GAN, LipGAN, along with the frames from the original video.

LipGAN

The LipGAN generator network contains three branches-

Face Encoder
The encoder consists of residual blocks with intermediate down-sampling layers. Instead of passing a face image of a random pose and its corresponding audio segment to the generator, the LipGAN model inputs the target face with the bottom-half masked to act as a pose prior. This allows the generated face crops to be seamlessly pasted back into the original video without further post-processing.
Audio Encoder
LipGAN uses a standard CNN that takes a Mel-frequency cepstral coefficient (MFCC) heatmap for the audio encoder
Face Decoder
This branch takes the concatenated audio and face embeddings and creates a lip-synchronized face by inpainting the masked region of the input image with an appropriate mouth shape. It contains a series of residual blocks with a few intermediate deconvolutional layers that upsample the feature maps. The output layer of the decoder is a sigmoid activated 1×1 convolutional layer with 3 filters.

The generator is trained to minimize L1 reconstruction loss between the generated frames and ground-truth frames

The discriminator network contains the same audio and face encoder as the generator network. It learns to detect synchronization by minimizing the following

contrastive loss:

Wav2Lip

Since the “Towards Automatic Face-to-Face Translation” paper, the authors have come up with a better lip sync model Wav2Lip. The significant difference between the two is the discriminator. Wav2Lip uses a pre-trained lip-sync expert combined with a visual quality discriminator.

See Also

The expert lip-sync discriminator is a modified, deeper SyncNet with residual connections trained on color images. It computes the dot product between the ReLU-activated video and speech embeddings. This yields the probability of the input audio-video pair being in sync:

Along with the L1 reconstruction loss, in Wav2Lip the generator is trained to also minimize the expert sync-loss

The visual quality discriminator consists of a stack of convolutional blocks. Each block consists of a convolutional layer followed by a leaky ReLU activation. It is trained to minimize the following objective function:

Combining everything, the generator minimizes the weighted sum of the reconstruction(L1) loss, the synchronization loss (expert sync-loss), and the adversarial loss L_gen.

Speech to Lip Generation using Wav2Lip

Install ffmpeg
sudo apt-get install ffmpeg
Create a new environment using either conda or venv
conda create --name myenv or python3 -m venv
Clone the Wave2Lip repository
git clone https://github.com/Rudrabha/Wav2Lip.git

Move inside the Wave2Lip directory and install the necessary modules from the requirement.txt file
cd Wav2Lip pip install -r requirements.txt
Download the pre-trained GAN model from here and move it into the “Wav2Lip/checkpoints/” folder
Download the face detection model and put it in “face_detection/detection/sfd/” folder and rename it to “s3fd.pth”
wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "Wav2Lip/face_detection/detection/sfd/s3fd.pth"
For the speech to lip generation to work it needs a video/image of the target face and a video/audio file containing the raw audio.
python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "input.jpg" --audio "input.mp4"

By default, the output video file named “result_voice.mp4” will be stored in the results folder, you can change this using the –outfile argument.