Saturday, January 11, 2025
Google search engine
HomeData Modelling & AIHow to Clone Voice and Lip-Sync a Video Like a Pro Using...

How to Clone Voice and Lip-Sync a Video Like a Pro Using Open-source Tools

Introduction

AI voice-cloning has taken social media by storm. It has opened a world of creative possibilities. You must have seen memes or AI voice-overs of famous personalities on social media. Have you wondered how it is done? Sure, many platforms provide APIs like Eleven Labs, but can we do it for free, using open-source software? The short answer is YES. The open-source has TTS models and lip-syncing tools to achieve voice synthesis. So, in this article, we will explore open-source tools and models for voice-cloning and lip-syncing.

AI voice cloning and lip syncing using open-source tools

Learning Objectives

  • Explore open-source tools for AI voice-cloning and lip-syncing.
  • Use FFmpeg and Whisper to transcribe videos.
  • Use the Coqui-AI’s xTTS model to clone voice.
  • Use the Wav2Lip for lip-syncing videos.
  • Explore real-world use cases of this technology.

This article was published as a part of the Data Science Blogathon.

Open-Source Stack

As you already know, we will use OpenAI’s Whisper, FFmpeg, Coqui-ai’s xTTS model, and Wav2lip as our tech stack. But before delving into the codes, let’s briefly discuss these tools. And also thanks to the authors of these projects.

Whisper: Whisper is OpenAI’s ASR (Automatic Speech Recognition) model. It is an encoder-decoder transformer model trained with over 650k hours of diverse audio data and corresponding transcripts. Thus making it very potent at a multi-lingual transcription from audio.

The encoders receive the log-mel spectrogram of 30-second chunks of audio. Each encoder block uses self-attention to understand different parts of audio signals. The decoder then receives hidden state information from encoders and learned positional encodings. The decoder uses self-attention and cross-attention to predict the next token. At the end of the process, it outputs a sequence of tokens representing the recognized text. For more on Whisper, refer to the official repository.

Coqui TTS:  TTS is an open-source library from Coqui-ai. It hosts multiple text-to-speech models. It has end-to-end models like Bark, Tortoise, and xTTS, spectrogram models like Glow-TTS, FastSpeech, etc, and Vocoders like Hifi-GAN, MelGAN, etc. Moreover, it provides a unified API for inferencing, fine-tuning, and training text-to-speech models. In this project, we will use xTTS, an end-to-end multi-lingual voice-cloning model. It supports 16 languages, including English, Japanese, Hindi, Mandarin, etc. For more information about the TTS, refer to the official TTS repository.

Wav2Lip: Wav2lip is a Python repository for the paper “A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild.” It uses a lip-sync discriminator to recognize face and lip movements. This works out great for dubbing voices. For more information, refer to the official repository. We will use this forked repository of Wav2lip.

Workflow

Now that we are familiar with the tools and models we will use, let’s understand the workflow. This is a simple workflow. So, here is what we will do.

  • Upload a video to the Colab runtime and resize it to 720p format for better lip-syncing.
  • Use FFmpeg to extract 24-bit audio from the video and use Whisper to transcribe the audio file.
  • Use Google Translate or an LLM to translate the transcribed script to another language.
  • Load the Multi-lingual xTTS model with the TTS library and pass the script and reference audio model for voice synthesis.
  • Clone the Wav2lip repository and download model checkpoints. Run the inference.py file to sync the original video with synthesized audio.
"

Now, let’s delve into the codes.

Step 1: Install Dependencies

This project would require significant RAM and GPU consumption, so it is prudent to use a Colab runtime. The free tier Colab provides 12GB of CPU and 15GB of T4 GPU. This should be enough for this project. So, head over to your Colab and connect to a GPU runtime.

Now, install the TTS and Whisper.

!pip install TTS
!pip install git+https://github.com/openai/whisper.git 

Step 2: Upload Videos to Colab

Now, we will upload a video and resize it to 720p format. The Wav2lip tends to perform better when the videos are in 720p format. This can be done using FFmpeg.

#@title Upload Video

from google.colab import files
import os
import subprocess

uploaded = None
resize_to_720p = False

def upload_video():
  global uploaded
  global video_path  # Declare video_path as global to modify it
  uploaded = files.upload()
  for filename in uploaded.keys():
    print(f'Uploaded {filename}')
    if resize_to_720p:
        filename = resize_video(filename)  # Get the name of the resized video
    video_path = filename  # Update video_path with either original or resized filename
    return filename


def resize_video(filename):
    output_filename = f"resized_{filename}"
    cmd = f"ffmpeg -i {filename} -vf 'scale=-1:720' {output_filename}"
    subprocess.run(cmd, shell=True)
    print(f'Resized video saved as {output_filename}')
    return output_filename

# Create a form button that calls upload_video when clicked and a checkbox for resizing
import ipywidgets as widgets
from IPython.display import display

button = widgets.Button(description="Upload Video")
checkbox = widgets.Checkbox(value=False, description='Resize to 720p (better results)')
output = widgets.Output()

def on_button_clicked(b):
  with output:
    global video_path
    global resize_to_720p
    resize_to_720p = checkbox.value
    video_path = upload_video()

button.on_click(on_button_clicked)
display(checkbox, button, output)

This will output a form button for uploading videos from a local device and a checkbox for enabling 720p resizing. You can also upload a video manually to the current collab session and resize it using a subprocess.

Step 3: Audio Extraction and Whisper Transcription

Now that we have our video, the next thing we will do is extract audio using FFmpeg and use Whisper to transcribe.

# @title Audio extraction (24 bit) and whisper conversion
import subprocess

# Ensure video_path variable exists and is not None
if 'video_path' in globals() and video_path is not None:
    ffmpeg_command = f"ffmpeg -i '{video_path}' -acodec pcm_s24le -ar 48000 -q:a 0 -map a\
                       -y 'output_audio.wav'"
    subprocess.run(ffmpeg_command, shell=True)
else:
    print("No video uploaded. Please upload a video first.")

import whisper

model = whisper.load_model("base")
result = model.transcribe("output_audio.wav")

whisper_text = result["text"]
whisper_language = result['language']

print("Whisper text:", whisper_text)

This will extract audio from the video in 24-bit format and will use the Whisper Base to transcribe it. For better transcription, use Whisper small or medium models.

Step 4: Voice Synthesis

Now, to the voice cloning part. As I have mentioned before, we will use Coqui-ai’s xTTS model. This is one of the best open-source models out there for voice synthesis. Coqui-ai also provides many TTS models for different purposes; do check them. For our use case, which is voice-cloning, we will use the xTTS v2 model.

Load the xTTS model. This is a big model with a size of 1.87 GB. So, this will take a while.

# @title Voice synthesis
from TTS.api import TTS
import torch
from IPython.display import Audio, display  # Import the Audio and display modules

device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

XTTS currently supports 16 languages. Here are the ISO codes of languages the xTTS model supports.

print(tts.languages)


['en','es','fr','de','it','pt','pl','tr','ru','nl','cs','ar','zh-cn','hu','ko','ja','hi']

Note: Languages like English and French do not have a character limit, while Hindi has a character limit of 250. Few other languages might have the limit as well.

For this project, we will use the Hindi language, you can experiment with others as well.

So, the first thing we need now is to translate the transcribed text into Hindi. This can either be done by Google Translate package or using an LLM. As per my observations, GPT-3.5-Turbo performs much better than Google Translate. We can use OpenAI API to get our translation.

import openai

client = openai.OpenAI(api_key = "api_key")
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"translate the texts to Hindi {whisper_text}"}
  ]
)
translated_text = completion.choices[0].message
print(translated_text)

As we know, Hindi has a character limit, so we need to do text pre-processing before passing it to the TTS model. We need to split the text into chunks of less than 250 characters.

text_chunks = translated_text.split(sep = "।")
final_chunks = [""]
for chunk in text_chunks:
  if not final_chunks[-1] or len(final_chunks[-1])+len(chunk)<250:
    chunk += "।"
    final_chunks[-1]+=chunk.strip()
  else:
    final_chunks.append(chunk+"।".strip())
final_chunks

This is a very simple splitter. You can create a different one or use Langchain’s recursive text-splitter. Now, we will pass each chunk to the TTS model. The resulting audio files will be merged using FFmpeg.

def audio_synthesis(text, file_name):
  tts.tts_to_file(
      text,
      speaker_wav='output_audio.wav',
      file_path=file_name,
      language="hi"
  )
  return file_name
file_names = []
for i in range(len(final_chunks)):
    file_name = audio_synthesis(final_chunks[i], f"output_synth_audio_{i}.wav")
    file_names.append(file_name)

As all the files have the same codec, we can easily merge them with FFmpeg. To do this, create a Txt file and add the file paths.

# this is a comment
file 'output_synth_audio_0.wav'
file 'output_synth_audio_1.wav'
file 'output_synth_audio_2.wav'

Now, run the code below to merge files.

import subprocess

cmd = "ffmpeg -f concat -safe 0 -i my_files.txt -c copy final_output_synth_audio_hi.wav"
subprocess.run(cmd, shell=True)

This will output the final concatenated audio file. You can also play the audio in Colab.

from IPython.display import Audio, display
display(Audio(filename="final_output_synth_audio_hi.wav", autoplay=False))

Step 5: Lip-Syncing

Now, to the lip-syncing part. To lip-sync our synthetic audio with the original video, we will use the Wav2lip repository. To use Wav2lip to sync audio, we need to install the model checkpoints. But before that, if you are on T4 GPU runtime, delete the xTTS and Whisper models in the current Colab session or restart the session.

import torch

try:
    del tts
except NameError:
    print("Voice model already deleted")

try:
    del model
except NameError:
    print("Whisper model  deleted")

torch.cuda.empty_cache()

Now, clone the Wav2lip repository and install the checkpoints.

# @title Dependencies
%cd /content/

!git clone https://github.com/justinjohn0306/Wav2Lip
!cd Wav2Lip && pip install -r requirements_colab.txt

%cd /content/Wav2Lip

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases \
/download/models/wav2lip.pth' -O 'checkpoints/wav2lip.pth'

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases \
/download/models/wav2lip_gan.pth' -O 'checkpoints/wav2lip_gan.pth'

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases \
/download/models/mobilenet.pth' -O 'checkpoints/mobilenet.pth'

!pip install batch-face

The Wav2lip has two models for lip-syncing. wav2lip and wav2lip_gan. According to the authors of the models, the GAN model requires less effort in face detection but produces slightly inferior results. In contrast, the non-GAN model can produce better results with more manual padding and rescaling of the detection box. You can try out both and see which one is doing better.

Run the inference with the model checkpoint path, video, and audio files.

%cd /content/Wav2Lip

#This is the detection box padding, adjust incase of poor results. 
#Usually, the bottom one is the biggest issue
pad_top =  0
pad_bottom =  15
pad_left =  0
pad_right =  0
rescaleFactor =  1

video_path_fix = f"'../{video_path}'"

!python inference.py --checkpoint_path 'checkpoints/wav2lip_gan.pth' \
--face $video_path_fix --audio "/content/final_output_synth_audio_hi.wav" \
--pads $pad_top $pad_bottom $pad_left $pad_right --resize_factor $rescaleFactor --nosmooth \ 
--outfile '/content/output_video.mp4'

This will output a lip-synced video. If the video doesn’t look good, adjust the parameters and retry.

So, here is the repository for the notebook and a few samples.

GitHub Repository: sunilkumardash9/voice-clone-and-lip-sync

Real-world Use Cases

Video voice-cloning and lip-syncing technology have a lot of use cases across industries. Here are a few cases where this can be beneficial.

Entertainment: The entertainment industry will be the most affected industry of all. We are already witnessing the change. Voices of celebrities of current and bygone eras can be synthesized and re-used. This also poses ethical challenges. The use of synthesized voices should be done responsively and within the perimeter of laws.

Marketing: Personalized ad campaigns with familiar and relatable voices can greatly enhance brand appeal.

Communication: Language has always been a barrier to all sorts of activities. Cross-language communication is still a challenge. Realtime end-to-end translation while keeping one’s accent and voice will revolutionize the way we communicate. This might become a reality in a few years.

Content Creation: Content creators will no longer depend on translators to reach a bigger audience. With efficient voice cloning and lip-syncing, cross-language content creation will be easier. Podcasts and audiobook narration experience can be enhanced with voice synthesis.

Conclusion

Voice synthesis is one of the most sought-after use cases of generative AI. It has the potential to revolutionize the way we communicate. Ever since the creation of civilizations, the language barrier between communities has been a hurdle for forging deeper relationships, culturally and commercially. With AI voice synthesis, this gap can be filled. So, in this article, we explored the open-source way of voice-cloning and lip-syncing.

Key Takeaways

  • TTS, a Python library by Coqui-ai, serves and maintains popular text-to-speech models.
  • The xTTS is a multi-lingual voice cloning model capable of cloning voice to 16 different languages.
  • Whisper is an ASR model from OpenAI for efficient transcription and English translation.
  • Wav2lip is an open-source tool for lip-syncing videos.
  • Voice cloning is one of the most happening frontiers of generative AI, with a significant potential impact on industries from entertainment to marketing.

Frequently Asked Questions

Q1. Is AI voice cloning legal?

A. Cloning voice might be illegal as it infringes on copyright. However, getting permission from the person before cloning is the right way to go about it.

Q2. Is AI voice cloning free?

A. Most AI voice cloning API services require fees. However, some open-source models can give fairly decent voice synthesis capability.

Q3. What is the best voice cloning model?

A. This depends on particular use cases. The xTTS model is a good choice for multi-lingual voice synthesis. But for more languages, Meta’s Fairseq models might be preferable.

Q4. Can AI clone celebrity voices?

A. Yes, it is possible to clone the voice of a celebrity. However, be mindful that any potential misuse can land you in legal trouble.

Q5. What is the use of voice cloning?

A. Voice cloning can be beneficial for a range of use cases, such as content creation, narration in games and movies, Ad campaigns, etc.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sunil Kumar Dash

19 Dec 2023

RELATED ARTICLES

Most Popular

Recent Comments