A deep learning text-to-speech library supporting multiple languages and voice styles.
Coqui TTS (github.com/coqui-ai/TTS) is a powerful open-source library for deep learning-based Text-to-Speech (TTS) synthesis. Originally developed by Coqui.ai, the project provides a comprehensive toolkit for researchers, developers, and hobbyists to generate high-quality speech, train new TTS models, fine-tune existing ones, and perform voice cloning. Despite Coqui.ai ceasing operations in late 2023/early 2024, the 🐸TTS library remains available as an open-source project under the MPL-2.0 license, with ongoing community interest and development, including forks and related projects building upon its engine.
The library is known for its wide range of model architectures, support for multiple languages, multi-speaker capabilities, and advanced features like zero-shot voice cloning with models like XTTS.
Coqui TTS offers a rich set of features for advanced speech synthesis:
tts --list_models
command can show models with speaker IDs.tts
command for:
Following the shutdown of Coqui.ai, the coqui-ai/TTS
GitHub repository remains a valuable open-source resource. While active development and official support from the original Coqui.ai team have ceased, the project continues to see community engagement, including forks, discussions, and issue tracking on GitHub. Some community members and organizations may continue to build upon or maintain aspects of the library. Projects like "AllTalk TTS" have emerged, based on the Coqui TTS engine, aiming to provide continued support and new features. Users should check the GitHub repository's "Discussions" and "Issues" sections for the latest community insights and developments.
Coqui TTS is primarily a Python library and can be installed using pip:
pip install TTS
git clone [https://github.com/coqui-ai/TTS.git](https://github.com/coqui-ai/TTS.git)
cd TTS
pip install -e ".[all,dev,notebooks]" # Choose extras as needed (e.g., 'tf' for TensorFlow)
espeak-ng
is often required for phonemization in some languages:
sudo apt-get install espeak-ng # On Debian/Ubuntu
Always refer to the official README.md
in the GitHub repository for the latest and most detailed installation instructions.
Coqui TTS can be used via its command-line interface or Python API.
The tts
command is the primary CLI tool.
List available models:
tts --list_models
This will output a list of model strings like tts_models/en/ljspeech/tacotron2-DDC
.
Synthesize speech with a specific model:
tts --text "Hello world, this is a test." \
--model_name "tts_models/en/ljspeech/tacotron2-DDC" \
--out_path output.wav
Synthesize speech using a multi-speaker model (listing speakers first):
tts --model_name "tts_models/en/vctk/vits" --list_speaker_idxs
# After identifying a speaker_id (e.g., p225)
tts --text "This is a test with a specific speaker." \
--model_name "tts_models/en/vctk/vits" \
--speaker_idx "p225" \
--out_path speaker_output.wav
Using XTTS for voice cloning: (Ensure you have an XTTS model downloaded/specified and a reference audio file.)
tts --text "Clone this voice for me." \
--model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
--speaker_wav path/to/your/reference_audio.wav \
--language en \
--out_path cloned_speech.wav
Start TTS server:
tts --model_name "tts_models/en/ljspeech/tacotron2-DDC" --use_cuda true --server
This will start an HTTP server (default on port 5002) that you can send synthesis requests to. Check the documentation for the exact API endpoints.
The Python API offers more fine-grained control.
import torch
from TTS.api import TTS
# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"
# List available 🐸TTS models programmatically
# model_manager = TTS().list_models() # This method may vary, refer to docs
# available_models = model_manager.list_models() # Or directly use the class method if static
# print(available_models)
# Example 1: Synthesize speech with a single-speaker model
# Initialize TTS with the target model name
tts_model_name = "tts_models/en/ljspeech/tacotron2-DDC" # Example model
tts = TTS(model_name=tts_model_name).to(device)
# Synthesize speech to a file
tts.tts_to_file(text="Hello, I am a test message generated by Coqui TTS.", file_path="output_api.wav")
# Example 2: Synthesize speech with a multi-speaker model
# tts_model_name_ms = "tts_models/en/vctk/vits" # Example multi-speaker model
# tts_ms = TTS(model_name=tts_model_name_ms).to(device)
# You might need to inspect tts_ms.speakers or use list_speaker_idxs via CLI to get valid speaker IDs
# if hasattr(tts_ms, 'speakers') and tts_ms.speakers:
# tts_ms.tts_to_file("This is a multi-speaker test.", speaker=tts_ms.speakers[0], file_path="output_ms.wav")
# else: # Fallback if direct speaker list is not available, use a known speaker_id if possible
# print(f"Speaker IDs for {tts_model_name_ms} not directly available in this example, check CLI or model card.")
# Example 3: Voice Cloning with XTTS
# Make sure you have an XTTS model identifier from `tts --list_models`
xtts_model_name = "tts_models/multilingual/multi-dataset/xtts_v2" # Ensure this is a valid identifier
tts_xtts = TTS(model_name=xtts_model_name).to(device)
# Clone voice from an audio file and synthesize text
tts_xtts.tts_to_file(
text="This is a voice cloned message in English.",
speaker_wav="path/to/your/reference_audio.wav", # Provide a 3-10 second audio clip
language="en", # Specify the language of the text
file_path="output_cloned_en.wav"
)
# Example for cross-lingual voice cloning with XTTS
# tts_xtts.tts_to_file(
# text="Ce message est cloné en français.", # French text
# speaker_wav="path/to/your/reference_audio.wav", # Reference audio can be in English or another supported lang
# language="fr", # Target language for synthesis
# file_path="output_cloned_fr.wav"
# )
Note: Specific model names and API calls for speaker listing in loaded models might vary. Always check tts --list_models
for current model identifiers and the official documentation (https://tts.readthedocs.io/en/latest/) for precise API usage.
Coqui TTS provides extensive support for training your own models or fine-tuning existing ones. This typically involves:
config.json
) that defines the model architecture (e.g., VITS, Tacotron2), audio processing parameters, training parameters (batch size, learning rate), dataset paths, and chosen vocoder.train.py
script provided in the library (the exact script name might vary, e.g., TTS/bin/train_tts.py
).
python TTS/bin/train_tts.py --config_path path/to/your/config.json
Detailed guides and recipes are available in the Coqui TTS documentation and often in the recipes
folder of the GitHub repository.
Coqui TTS offers a variety of pre-trained models for different languages, speakers, and model architectures. You can list these models using the CLI:
tts --list_models
This command provides the identifiers (e.g., tts_models/en/ljspeech/vits
, tts_models/multilingual/multi-dataset/xtts_v2
) needed to use them with the tts
command or the Python API. The models cover many languages. The claim of "1100+ languages" found in some older documentation likely refers to the broad multilingual capabilities and training data potential of models like XTTS, rather than distinct pre-trained models for each individual language. Always refer to the output of tts --list_models
for currently available models.
The Coqui TTS library itself is licensed under the Mozilla Public License 2.0 (MPL-2.0). This is a copyleft license that allows for use in proprietary software as long as modifications to MPL-licensed files are shared.
Important: Pre-trained models provided by Coqui.ai (especially newer ones like XTTS) often came with their own licenses, such as the Coqui Public Model License, which may have restrictions, including for commercial use. Always check the license associated with any specific pre-trained model you intend to use, usually found with the model on Hugging Face or the Coqui model zoo.
Even after Coqui.ai ceased operations, community channels remain active for users of the open-source TTS library:
Last updated: May 16, 2025