Table of Contents
Tortoise Text-to-Speech (TTS) is a powerful tool that converts text into spoken audio. This blog post will guide you through installing Tortoise TTS on your local machine. Please note that Tortoise TTS requires an NVIDIA GPU for optimal performance.
Step 1: Install PyTorch
Before installing Tortoise TTS, it is essential to install PyTorch. PyTorch is a robust machine learning framework developed by Meta AI and is now part of the Linux Foundation. It is based on the Torch library and is widely used for computer vision and natural language processing applications. PyTorch is free and open-source software, released under the modified BSD license.
One of the critical features of PyTorch is its ability to accelerate the scientific computation of tensors, which generalize vectors and matrices to potentially higher dimensions. PyTorch has various inbuilt functions to handle these tensors, making it a powerful tool for deep learning and AI applications.
- Visit the PyTorch website at https://pytorch.org/get-started/locally/.
- Choose the appropriate installation options for your operating system and GPU setup. Using the Conda installation path is recommended for easier dependency management on using Windows.
- Follow the provided installation instructions to install PyTorch on your system.
Step 2: Clone the Tortoise TTS Repository
Once you have installed PyTorch, you can clone the Tortoise TTS repository from GitHub.
Open a terminal or command prompt and execute the following commands in my example, I am using Visual Studio Code:
git clone https://github.com/neonbjb/tortoise-tts.git cd tortoise-tts
Step 3: Install Dependencies
Tortoise TTS has a few dependencies that need to be installed. To install these dependencies, run the following command:
python -m pip install -r ./requirements.txt
If you are using Windows, you will also need to install
pysoundfile. You can do this by running the following command:
Step 4: Install Tortoise TTS
Now that you have installed the necessary dependencies, you can install Tortoise TTS. Run the following command:
python setup.py install
Step 5: Test Tortoise TTS
You can run a test script to verify that Tortoise TTS has been installed successfully. Tortoise TTS provides two scripts:
read.py. Let’s use
do_tts.py to speak a single phrase with a randomly selected voice.
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
If everything is set up correctly, you should hear the spoken output of the provided text.
Step 6: Using Tortoise TTS Programmatically
Tortoise TTS can also be used programmatically in your own Python scripts. Here is an example of how to use the Tortoise TTS API:
import utils.audio import tortoise.api as api reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths] tts = api.TextToSpeech() pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
This code snippet demonstrates how to generate audio using Tortoise TTS programmatically. You can modify the text and voice samples according to your requirements.
Congratulations! You have successfully installed Tortoise Text-to-Speech on your local machine. Now you can leverage the power of Tortoise TTS to convert text into high-quality spoken audio. Enjoy exploring the possibilities of this versatile tool!
Guide: Using Tortoise Text-to-Speech
Tortoise Text-to-Speech (TTS) provides various features and customization options for generating high-quality and personalized voice outputs. In this guide, we will explore how to make the most of Tortoise TTS and customize your voice generation experience.
Tortoise TTS allows you to customize voices by providing reference clips of speakers. These clips help determine the properties of the generated speech, such as pitch, tone, speed, and even speaking defects like a lisp or stuttering. Additionally, the clips influence non-voice aspects like volume, background noise, recording quality, and reverb.
Tortoise TTS includes a fascinating feature that generates random voices. These voices are not pre-existing and will be different each time you run the program. To utilize the random voice, simply pass ‘random’ as the voice name, and Tortoise TTS will handle the rest.
For machine learning enthusiasts, the random voice is created by projecting a random vector onto the voice conditioning latent space.
The Tortoise TTS repository comes with several pre-packaged voices. Voices starting with “train_” are derived from the training set and generally perform better than others.
If your goal is to obtain high-quality speech, it is recommended to select one of these voices. However, if you want to explore the capabilities of Tortoise TTS for zero-shot mimicking, you can experiment with other voices as well.
Adding a New Voice
To add new voices to Tortoise TTS, follow these steps:
- Gather audio clips of your desired speaker(s). Suitable sources include YouTube interviews (you can use tools like youtube-dl to fetch the audio), audiobooks, or podcasts.
- Cut the clips into approximately 10-second segments. Aim for at least three clips, although more can yield better results. During testing, up to five clips were used.
- Save the clips as WAV files with a floating-point format and a sample rate of 22,050.
- Create a subdirectory within the
- Place the clips in the newly created subdirectory.
- Run the Tortoise TTS utilities with the
Picking Good Reference Clips
The reference clips you provide significantly impact the output of Tortoise TTS. Here are some tips for selecting suitable clips:
- Avoid clips with background music, noise, or reverb, as these can negatively affect the speech generation process.
- Steer clear of clips from phone calls or those that contain distortions caused by amplification systems.
- Minimize the use of clips with excessive stuttering, stammering, or filler words like “uh” or “like.”
- Try to find clips that exemplify the desired speech style. For instance, if you want the target voice to read an audiobook, look for clips of the speaker reading books.
- The text spoken in the clips does not matter, but using diverse text may yield better results.
Tortoise TTS offers advanced settings and techniques for fine-tuning the voice generation process.
Tortoise TTS combines an autoregressive decoder model with a diffusion model, both of which have adjustable settings. While the default settings are optimized for general usage, specific use cases may benefit from tweaking these settings. Refer to the
api.tts module for a full list of available settings.
Please note that these settings are not available in the standard scripts packaged with Tortoise TTS but can be accessed through the API.
Prompt engineering refers to the manipulation of prompts to evoke specific responses from Tortoise TTS. For example, you can introduce emotional context by including phrases like “I am really sad” before the main text. Tortoise TTS provides an automated redaction system that allows you to take advantage of prompt engineering.
By surrounding specific text segments with brackets, you can instruct Tortoise TTS to focus only on the unredacted portion. For instance, the prompt “[I am really sad,] Please feed me” will result in the speech output of “Please feed me” with a sad tonality.
Playing with the Voice Latent
Tortoise TTS processes reference clips individually through a submodel, producing a point latent for each clip. These point latents influence various aspects of the generated speech, such as tone, speaking rate, and abnormalities.
You can experiment with these point latents to achieve unique effects. For example, combining the point latents of two different voices can yield an output that represents the “average” of those voices.
Generating Conditioning Latents from Voices
You can extract conditioning latents for installed voices using the
get_conditioning_latents.py script. This script generates a .pth pickle file containing the conditioning latents. The file includes a tuple with the autoregressive latent and diffusion latent.
Alternatively, you can use
api.TextToSpeech.get_conditioning_latents() to fetch the latents programmatically.
Using Raw Conditioning Latents for Speech Generation
After experimenting with conditioning latents, you can utilize them to generate speech by creating a subdirectory within
voices/. Place a single “.pth” file in the subdirectory, containing the pickled conditioning latents as a tuple: (autoregressive_latent, diffusion_latent).
More Open-Source AI Tools
Here is a list of open-source AI tools we have reviewed:
DragGAN – DragGAN is an advanced and innovative photo editing tool developed by the Max Planck Institute that utilizes artificial intelligence to transform and modify images.
Aider AI – This is a terminal-based chat tool that enables you to create and modify code utilizing OpenAI’s GPT models. GPT can assist you in initiating a new project or altering code in your current git repository.
MusicGen – MusicGen is an AI music generation tool developed by Meta, capable of creating high-quality music samples from simple text prompts and offering the feature to upload audio clips for additional guidance.
SuperAGI – My personal favorite this is multipurpose AI autonomous agent similar to AutoGPT but extremely enhanced version with a GUI.
AutoGPT – One of the first AI autonomous agents that to utilise OpenAI ChatGPT API models.
In conclusion, Tortoise Text-to-Speech (TTS) is a versatile and powerful tool that converts text into high-quality spoken audio. This comprehensive guide has walked you through the installation process, from setting up PyTorch to cloning the Tortoise TTS repository and installing the necessary dependencies.
It has also delved into the advanced features of Tortoise TTS, such as voice customization, random voice generation, and prompt engineering. Furthermore, it has introduced you to other remarkable open-source AI tools like DragGAN, Aider AI, MusicGen, SuperAGI, and AutoGPT.
Now, you are well-equipped to leverage the capabilities of Tortoise TTS and explore the fascinating world of text-to-speech technology. Keep experimenting, keep learning, and enjoy the journey!