TTS icon

TTS

A deep learning text-to-speech library supporting multiple languages and voice styles.

Coqui TTS: Deep Learning Toolkit for Text-to-Speech

Introduction

Coqui TTS (github.com/coqui-ai/TTS) is a powerful open-source library for deep learning-based Text-to-Speech (TTS) synthesis. Originally developed by Coqui.ai, the project provides a comprehensive toolkit for researchers, developers, and hobbyists to generate high-quality speech, train new TTS models, fine-tune existing ones, and perform voice cloning. Despite Coqui.ai ceasing operations in late 2023/early 2024, the 🐸TTS library remains available as an open-source project under the MPL-2.0 license, with ongoing community interest and development, including forks and related projects building upon its engine.

The library is known for its wide range of model architectures, support for multiple languages, multi-speaker capabilities, and advanced features like zero-shot voice cloning with models like XTTS.

Key Features

Coqui TTS offers a rich set of features for advanced speech synthesis:

  • Diverse Deep Learning Models:
    • Text-to-Spectrogram Models: Includes architectures like Tacotron, Tacotron2, Glow-TTS, SpeedySpeech, and the powerful VITS (Variational Inference with Adversarial Learning for End-to-End Speech Synthesis).
    • Vocoders: Provides various neural vocoders to convert mel-spectrograms into high-fidelity audio waveforms, such as MelGAN, Multiband-MelGAN, HiFi-GAN, WaveGrad, and UnivNet.
    • End-to-End Models (like VITS, XTTS): Some models generate audio directly from text in a single step, often offering higher quality and more natural prosody.
  • XTTS (Coqui XTTS):
    • A cutting-edge, multilingual, multi-speaker TTS model capable of high-quality voice cloning from very short audio samples (as little as 3-6 seconds).
    • Supports a wide range of languages (e.g., English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, Hindi).
    • XTTS-v2 is a notable version. (Note: XTTS models released by Coqui.ai were often under a non-commercial Coqui Public Model License; users should verify licenses for specific pre-trained XTTS models they use).
  • Multi-lingual Support: Many pre-trained models and the underlying architecture support synthesizing speech in numerous languages.
  • Multi-speaker TTS: Ability to generate speech in different voices using speaker embeddings or dedicated multi-speaker models. The tts --list_models command can show models with speaker IDs.
  • Voice Cloning: Advanced capabilities, especially with XTTS, to clone a target voice from a short audio reference and synthesize new speech in that voice, even across languages (cross-lingual voice cloning).
  • Training & Fine-tuning: Provides comprehensive scripts, recipes, and utilities for:
    • Training new TTS models from scratch on custom datasets.
    • Fine-tuning existing pre-trained models to adapt to new voices, languages, or speaking styles.
    • Tools for dataset analysis and curation.
  • Pre-trained Models: Offers a wide variety of pre-trained models covering many languages and voices, accessible via the library. These can be listed using the CLI.
  • Command-Line Interface (CLI): A user-friendly tts command for:
    • Listing available models.
    • Synthesizing speech from text to a file.
    • Specifying models, vocoders, speaker IDs (for multi-speaker models), and language.
    • Starting a TTS server.
  • Python API: A flexible Python API for programmatic control over TTS synthesis, model loading, speaker selection, voice cloning, and integration into applications.
  • TTS Server Mode: Ability to run TTS models as an HTTP server, allowing other applications to request speech synthesis via API calls.
  • Speaker Encoder: Includes tools to compute speaker embeddings from audio samples, which are crucial for multi-speaker TTS and voice cloning.

Current Status of the Project

Following the shutdown of Coqui.ai, the coqui-ai/TTS GitHub repository remains a valuable open-source resource. While active development and official support from the original Coqui.ai team have ceased, the project continues to see community engagement, including forks, discussions, and issue tracking on GitHub. Some community members and organizations may continue to build upon or maintain aspects of the library. Projects like "AllTalk TTS" have emerged, based on the Coqui TTS engine, aiming to provide continued support and new features. Users should check the GitHub repository's "Discussions" and "Issues" sections for the latest community insights and developments.

Installation

Coqui TTS is primarily a Python library and can be installed using pip:

  1. Basic Installation (for inference with PyTorch models):
    pip install TTS
    
  2. Installation with all dependencies (for training, development, specific backends): It's often recommended to install within a virtual environment.
    git clone [https://github.com/coqui-ai/TTS.git](https://github.com/coqui-ai/TTS.git)
    cd TTS
    pip install -e ".[all,dev,notebooks]" # Choose extras as needed (e.g., 'tf' for TensorFlow)
    
  3. System Dependencies:
    • espeak-ng is often required for phonemization in some languages:
      sudo apt-get install espeak-ng # On Debian/Ubuntu
      
    • Other system dependencies might be needed depending on your OS and the features you intend to use.

Always refer to the official README.md in the GitHub repository for the latest and most detailed installation instructions.

Usage

Coqui TTS can be used via its command-line interface or Python API.

Command-Line Interface (CLI)

The tts command is the primary CLI tool.

  • List available models:

    tts --list_models
    

    This will output a list of model strings like tts_models/en/ljspeech/tacotron2-DDC.

  • Synthesize speech with a specific model:

    tts --text "Hello world, this is a test." \
        --model_name "tts_models/en/ljspeech/tacotron2-DDC" \
        --out_path output.wav
    
  • Synthesize speech using a multi-speaker model (listing speakers first):

    tts --model_name "tts_models/en/vctk/vits" --list_speaker_idxs
    # After identifying a speaker_id (e.g., p225)
    tts --text "This is a test with a specific speaker." \
        --model_name "tts_models/en/vctk/vits" \
        --speaker_idx "p225" \
        --out_path speaker_output.wav
    
  • Using XTTS for voice cloning: (Ensure you have an XTTS model downloaded/specified and a reference audio file.)

    tts --text "Clone this voice for me." \
        --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
        --speaker_wav path/to/your/reference_audio.wav \
        --language en \
        --out_path cloned_speech.wav
    
  • Start TTS server:

    tts --model_name "tts_models/en/ljspeech/tacotron2-DDC" --use_cuda true --server
    

    This will start an HTTP server (default on port 5002) that you can send synthesis requests to. Check the documentation for the exact API endpoints.

Python API

The Python API offers more fine-grained control.

import torch
from TTS.api import TTS

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

# List available 🐸TTS models programmatically
# model_manager = TTS().list_models() # This method may vary, refer to docs
# available_models = model_manager.list_models() # Or directly use the class method if static
# print(available_models)

# Example 1: Synthesize speech with a single-speaker model
# Initialize TTS with the target model name
tts_model_name = "tts_models/en/ljspeech/tacotron2-DDC" # Example model
tts = TTS(model_name=tts_model_name).to(device)

# Synthesize speech to a file
tts.tts_to_file(text="Hello, I am a test message generated by Coqui TTS.", file_path="output_api.wav")

# Example 2: Synthesize speech with a multi-speaker model
# tts_model_name_ms = "tts_models/en/vctk/vits" # Example multi-speaker model
# tts_ms = TTS(model_name=tts_model_name_ms).to(device)
# You might need to inspect tts_ms.speakers or use list_speaker_idxs via CLI to get valid speaker IDs
# if hasattr(tts_ms, 'speakers') and tts_ms.speakers:
#    tts_ms.tts_to_file("This is a multi-speaker test.", speaker=tts_ms.speakers[0], file_path="output_ms.wav")
# else: # Fallback if direct speaker list is not available, use a known speaker_id if possible
#    print(f"Speaker IDs for {tts_model_name_ms} not directly available in this example, check CLI or model card.")


# Example 3: Voice Cloning with XTTS
# Make sure you have an XTTS model identifier from `tts --list_models`
xtts_model_name = "tts_models/multilingual/multi-dataset/xtts_v2" # Ensure this is a valid identifier
tts_xtts = TTS(model_name=xtts_model_name).to(device)

# Clone voice from an audio file and synthesize text
tts_xtts.tts_to_file(
    text="This is a voice cloned message in English.",
    speaker_wav="path/to/your/reference_audio.wav", # Provide a 3-10 second audio clip
    language="en", # Specify the language of the text
    file_path="output_cloned_en.wav"
)

# Example for cross-lingual voice cloning with XTTS
# tts_xtts.tts_to_file(
#     text="Ce message est cloné en français.", # French text
#     speaker_wav="path/to/your/reference_audio.wav", # Reference audio can be in English or another supported lang
#     language="fr", # Target language for synthesis
#     file_path="output_cloned_fr.wav"
# )

Note: Specific model names and API calls for speaker listing in loaded models might vary. Always check tts --list_models for current model identifiers and the official documentation (https://tts.readthedocs.io/en/latest/) for precise API usage.

Training Custom Models & Fine-tuning

Coqui TTS provides extensive support for training your own models or fine-tuning existing ones. This typically involves:

  1. Dataset Preparation: Collecting and formatting your audio and text data according to the requirements (e.g., audio clips and corresponding transcripts, often in LJSpeech or similar formats).
  2. Configuration: Setting up a configuration file (config.json) that defines the model architecture (e.g., VITS, Tacotron2), audio processing parameters, training parameters (batch size, learning rate), dataset paths, and chosen vocoder.
  3. Training: Using the train.py script provided in the library (the exact script name might vary, e.g., TTS/bin/train_tts.py).
    python TTS/bin/train_tts.py --config_path path/to/your/config.json
    
  4. Monitoring: Using TensorBoard to monitor training progress, loss curves, and synthesized audio samples during training.

Detailed guides and recipes are available in the Coqui TTS documentation and often in the recipes folder of the GitHub repository.

Pre-trained Models

Coqui TTS offers a variety of pre-trained models for different languages, speakers, and model architectures. You can list these models using the CLI:

tts --list_models

This command provides the identifiers (e.g., tts_models/en/ljspeech/vits, tts_models/multilingual/multi-dataset/xtts_v2) needed to use them with the tts command or the Python API. The models cover many languages. The claim of "1100+ languages" found in some older documentation likely refers to the broad multilingual capabilities and training data potential of models like XTTS, rather than distinct pre-trained models for each individual language. Always refer to the output of tts --list_models for currently available models.

Hardware Requirements

  • Inference (Synthesizing Speech):
    • CPU: Possible for most models, but can be slow, especially for more complex models (like VITS, XTTS) or longer texts.
    • GPU: Highly recommended for faster inference. A modern NVIDIA GPU with a few GBs of VRAM (e.g., 4-8GB) is often sufficient for many pre-trained models. For XTTS-v2, consumer-grade GPUs can provide good performance. VRAM requirements depend heavily on the specific model.
  • Training/Fine-tuning:
    • GPU: Almost always requires a powerful NVIDIA GPU (or multiple GPUs) with substantial VRAM. For example, training YourTTS (a multi-speaker, multi-lingual model by Coqui) used an NVIDIA V100 32GB, though fine-tuning was reported as possible on 11GB VRAM cards with smaller batch sizes. Modern models like VITS or XTTS will also benefit from high VRAM (16GB+ often recommended for serious training).
    • RAM: Significant system RAM (32GB+, sometimes 64GB or more for large datasets) is usually needed.
    • Storage: Large amounts of disk space for datasets (can be 10s to 100s of GBs), model checkpoints, and environments. SSDs are highly recommended.

License

The Coqui TTS library itself is licensed under the Mozilla Public License 2.0 (MPL-2.0). This is a copyleft license that allows for use in proprietary software as long as modifications to MPL-licensed files are shared.

Important: Pre-trained models provided by Coqui.ai (especially newer ones like XTTS) often came with their own licenses, such as the Coqui Public Model License, which may have restrictions, including for commercial use. Always check the license associated with any specific pre-trained model you intend to use, usually found with the model on Hugging Face or the Coqui model zoo.

Community & Support

Even after Coqui.ai ceased operations, community channels remain active for users of the open-source TTS library:

  • GitHub Discussions: For questions, sharing projects, and general discussion. This is likely the most active place for current information.
  • GitHub Issues: For reporting bugs and tracking technical problems.
  • Archived Community Channels: Coqui.ai previously had official Gitter and later a Discord server. While official support through these may have ended, archives or community-led continuations might exist. Check pinned issues or discussions on GitHub for any current community-preferred channels (e.g., Matrix was also mentioned).

Ethical Considerations & Limitations

  • Voice Cloning Ethics: The powerful voice cloning capabilities (especially with XTTS) raise ethical concerns regarding potential misuse for creating deepfakes, impersonation, or unauthorized use of someone's voice. Responsible use and adherence to ethical guidelines are paramount.
  • Data Privacy: When training models on custom voice data, ensure you have the necessary rights and permissions for that data. Using the library locally for inference can offer good data privacy.
  • Model Quality & Artifacts: While Coqui TTS aims for high quality, synthesized speech can sometimes contain artifacts or sound unnatural depending on the model, data quality, and input text.
  • Bias: Like other machine learning models, TTS models can inherit biases from their training data, potentially affecting accent, prosody, or voice characteristics for underrepresented groups.
  • Computational Resources: Training and even high-quality inference can be computationally expensive, requiring specific hardware (primarily GPUs).
  • License Restrictions on Pre-trained Models: Be aware of the specific licenses for any pre-trained models you use (especially those originally released by Coqui.ai, like XTTS), as they may differ from the MPL-2.0 license of the library itself and could restrict commercial use.

Last updated: May 16, 2025

Found an error in our documentation?Email us for assistance