How to Clone a Voice (Open-Source)
Here is the video: https://youtu.be/3fg7Ht0DSnE?si=gyqIdakpKI0Qf1Fy
What the junk is this?#
- This is the 3rd video in my Home Assistant series. This video picks up where I left off (and where I failed) in my last video attempting to replace Alexa with my own, fully local, AI voice assistant.
- I did everything but create a custom AI voice, specifically the voice of Terry Crews.
- In this video, I FINALLY figured out a consistent way to clone a voice from pretty much any source. But it has to be LOCAL and FREE.
You can clone my voice#
- I am giving you permission to clone my voice…but not for commercial purposes or for content creation. You are only allowed to use my voice as your personal voice assistant.
- If you would like to clone my voice, here is the voice dataset: https://drive.google.com/drive/folders/1i6mWm8z2K_3rtanrVn1UcNgw5LQ0Zyv6?usp=sharing
What Do You Need?#
…just a computer. But if you have a GPU….better.
Here is what I am using and demonstrating in this video:
- Laptop with an Nvidia 3080 GPU
- Terry (my AI PC that I built in this video —> https://youtu.be/Wjrdr0NU4Sk?si=_zlFoZq5g3jAcntA) rocking dual Nvidia 4090s
- A cloud GPU (AWS EC2 instance)
- I used a g5.12xlarge instance (4 GPUs)
Step 1 - Data Prep#
- The key to any AI project is getting CLEAN DATA. Let’s do that now.
- For this tutorial, I’m going to show you two options:
- Record your voice using the Piper Recording Studio
- Pull audio data from YouTube videos
the Piper Recording Studio#
This option is for you if you want to clone your own voice (or someone close to you) and you are willing to sit still for an hour while saying nonsensical phrases to a computer. (Kind of what I do for a living.)
What do you need?#
- This will be VERY Linux-based, so you will need Mac, Linux or WSL2 running on Windows.
- In my demonstration, I’m showing you how to very quickly setup WSL2. (If you have no idea what WSL is but would like to know more….check out this video —-> https://youtu.be/vxTW22y8zV8?si=FpeB_ZiwrOweGZW_)
Quick WSL2 Setup#
wsl --install Ubuntu-22.04
## Why Ubuntu 22.04? It was the most stable in my testing.
Get some things setup#
(Linux and WSL2)
## Update repos
sudo apt update
## Clone the Piper Recording Studio repo from GitHub
git clone https://github.com/rhasspy/piper-recording-studio.git
## Jump into the directory
cd piper-recording-studio
## Create a Python Virtual Environment
python3 -m venv .venv
## Activate that Python Virtual Environment
source .venv/bin/activate
## Upgrade PIP
python3 -m pip install --upgrade pip
## Install requirements for this project
python3 -m pip install -r requirements.txt
## Now RUN the Piper Recording Studio
python3 -m piper_recording_studio
## Hit CTRL-C when you are done.
Record some stuff#
- Navigate to http://localhost:8000 to access the Web Interface of the Piper Recording Studio and start recording your lovely voice.
Prepare the stuff you recorded#
## Jump into the directory where your recordings live
cd output/en_US/nameofyourrecordingstuff.chat
## Install ffmpeg
sudo apt install ffmpeg
## Jump back to the piper_recording_studo directory and export your data
cd ../../
python3 -m export_dataset ~/path/to/your/recorded/files ~/your/output/directory
## Deactivate your virtual environment
deactivate
YouTube Videos#
Download YouTube Videos#
## Make a new directory for this data prep and jump in there
mkdir dataprep
cd dataprep
## Create a new python virtual environment and activate it
python3 -m venv .venv
source .venv/bin/activate
## Install YT-DLP to download YouTube Videos
pip install yt-dlp
## Download the audio from a YouTube video
yt-dlp -x --audio-format wav "youtube URL"
Cleanup your audio samples#
- Piper TTS will want audio samples between 5-15sec. Any larger and it will hate you.
Use Audacity to trim clips..etc#
- Audacity (free and open-source audio tool): https://www.audacityteam.org/
- When exporting with Audacity (or any other program) Piper is wanting the audio in this format:
- !
- Sample Rate: 22050 Hz
- Encoding: Signed 16-bit PCM
- Channels: mono
- Format: wav
Remove Silence#
for file in ./*.wav; do
ffmpeg -i "$file" -af "silenceremove=stop_periods=-1:stop_duration=3:stop_threshold=-20dB" "./${file%.wav}_nosilence.wav"
done
Cut up your files#
## Create another directory
mkdir wav
## Cut it up
for file in *.wav; do
ffmpeg -i "$file" -f segment -segment_time 15 -c copy "./wav/split_${file%.*}_%03d.wav"
done
Get your samples ready for Piper TTS#
Transcribe your audio samples with Whisper#
Install Whisper#
pip install git+https://github.com/openai/whisper.git
Create a new python script and run it#
nano whispersomethingtome.py
Paste this script
import os
import whisper
# Initialize Whisper model (you can choose between 'tiny', 'base', 'small', 'medium', 'large')
model = whisper.load_model("base")
# Path to the directory containing the audio files
audio_dir = "./wav"
output_csv = "./metadata.csv"
# List all .wav files in the directory
audio_files = [f for f in os.listdir(audio_dir) if f.endswith(".wav")]
audio_files.sort() # Sort the files alphabetically (optional)
# Open the CSV file for writing
with open(output_csv, "w") as f:
for audio_file in audio_files:
# Full path to the audio file
audio_path = os.path.join(audio_dir, audio_file)
# Transcribe the audio file
result = model.transcribe(audio_path)
# Extract the transcription text
transcription = result["text"].strip()
# Write the filename (without .wav extension) and transcription to the CSV
file_id = os.path.splitext(audio_file)[0] # Get file name without extension
f.write(f"{file_id}|{transcription}\n")
print(f"Transcriptions complete! Metadata saved to {output_csv}")
Run the script
python3 whispersomethingtome.py
Clean it up
## Deactivate your python virtual environment
deactivate
## Go back home
cd ~/
Prepare to Train (Piper Recording Studio and YouTube)#
Prep Your Training Environment#
## Make a new directory for training
mkdir training
cd training
## Clone the Piper Repo
git clone https://github.com/rhasspy/piper.git
## Create another python virtual environment and activate it
python3 -m venv .venv
source .venv/bin/activate
Solving your headaches#
Because Piper TTS is picky
## Install a VERY specific version of PIP
python3 -m pip install pip==23.3.1
## Install a VERY specific version of numpy
pip install numpy==1.24.4
## Install a VERY specific version of torchmetrics
pip install torchmetrics==0.11.4
More prep#
## Jump into the Piper directory
cd piper/src/python
## If you have a 4090...this will save you ---> https://github.com/rhasspy/piper/issues/295
## Setup wheel
python3 -m pip install --upgrade wheel setuptools
## Install requirements
pip3 install -e .
## Run this script (I don't know what it does...but piper needs it)
./build_monotonic_align.sh
Make Piper LOVE your data#
## This will process your dataset, getting it ready for Piper to train
python3 -m piper_train.preprocess \
--language en \
--input-dir ~/your_training_data_directory/ \
--output-dir ~/your_output_directory \
--dataset-format ljspeech \
--single-speaker \
--sample-rate 22050
Download a starting point#
- You DON’T want to start training from scratch. We will download a pre-trained model that already has a headstart.
- Download High, Medium, or Low…depending on what you want —> https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/en/en_US/lessac
## Downloading the Lessac (medium) checkpoint
wget https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/en/en_US/lessac/medium/epoch%3D2164-step%3D1355540.ckpt
Step 2 - Time to TRAIN!!!#
- Picking up from step 1, we are going to copy and paste this command.
python3 -m piper_train \
--dataset-dir ~/your_training_data_directory \
--accelerator 'gpu' \
--gpus 1 \
--batch-size 32 \
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 6000 \
--resume_from_checkpoint "/place/where/you/downloaded/thecheckpoint/epoch=2218-step=838782.ckpt" \
--checkpoint-epochs 1 \
--precision 16 \
--max-phoneme-ids 400 \
--quality medium
- If you have DUAL GPUs, this script has a few fun things
python3 -m piper_train \
--dataset-dir ~/new_chuck/train-me \
--accelerator 'gpu' \
--gpus 2 \
--batch-size 32 \
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 6000 \
--resume_from_checkpoint "/home/networkchuck/piper_project/base_checkpoints/high/epoch=2218-step=838782.ckpt" \
--checkpoint-epochs 1 \
--precision 16 \
--max-phoneme-ids 400 \
--quality high \
--strategy ddp
Stoping and Resuming#
- You can interrupt training and pickup where you left off.
- Just hit ctrl-c
- When you are ready to resume training, run the same command but change the checkpoint file to YOUR most recent checkpoint file. You would have seen this in the output of your training process.
!
When your final epoch is reached#
- It’s time to export your voice when training is done. Also know that you can export your voice at ANY point in training by simply targeting the most recent checkpoint file.
## Export the model
python3 -m piper_train.export_onnx \
"/path/to/your/checkpoint/checkpoints/epoch=5999-step=853906.ckpt" \
~/output/directory/voicename.onnx
## Copy the training json file to your model file directory
cp ~/training_directory/config.json ~/output/directory/voicename.onnx.json
Let’s test#
## Make sure you have Piper TTS installed
pip install piper-tts
## Test your voice
echo "Subscribe to NetworkChuck" | piper -m voicename.onnx --output_file test.wav
Read other posts