Bark icon

Bark

A text-to-audio generation model that can generate highly realistic multi-language speech and sound effects.

Bark: Generative Audio Model by Suno AI

Introduction

Bark (github.com/suno-ai/bark) is a transformer-based text-to-audio model developed by Suno AI, released in April 2023. It stands out for its ability to generate highly realistic, multilingual speech as well as other audio modalities, including music, background noise, and simple sound effects. Unlike traditional Text-to-Speech (TTS) systems that primarily focus on clean speech, Bark is a more general generative audio model, capable of producing a wide array of audio outputs from text prompts, including non-verbal cues like laughter, sighs, and crying.

The model is open-source (MIT License) and has gained popularity for its natural-sounding output and versatility. While Suno AI's primary product focus has shifted more towards music generation (with their Suno music platform), Bark remains a significant open-source contribution to the field of generative audio.

Key Features

Bark offers a unique set of features that distinguish it in the landscape of audio generation models:

  • Transformer-Based Generative Model: Utilizes a GPT-style architecture, processing text prompts to generate audio from scratch. It's composed of four main models:
    1. Text Model (Semantic): A causal auto-regressive transformer that predicts semantic tokens from input text.
    2. Coarse Acoustics Model: A causal auto-regressive transformer that predicts the first two audio codebooks (from EnCodec) based on the semantic tokens.
    3. Fine Acoustics Model: A non-causal auto-encoder transformer that iteratively predicts the subsequent audio codebooks.
    4. Audio Codec (Encodec): The predicted codebook channels are used by Encodec to decode the final audio waveform.
  • Highly Realistic Speech: Capable of generating very natural and expressive speech in multiple languages.
  • Multilingual Support: Supports a variety of languages out-of-the-box, often determining the language automatically from the input text. Quality is generally best for English, with support for other languages continually improving.
  • Non-Speech Sounds: A standout feature is its ability to generate non-verbal communications and other sounds directly from text prompts, such as:
    • Laughter: [laughs]
    • Sighs: [sighs]
    • Crying: [cries]
    • Gasps: [gasps]
    • Clears throat: [clears throat]
    • Music: ♪ [text describing music, e.g., "upbeat techno music"] ♪ or ♪ In the jungle, the mighty jungle... ♪
    • Hesitations: (e.g., uhm..., hmm)
    • Other sound effects: (e.g., [metallic clang], [siren])
  • Voice/Speaker Variety via History Prompts:
    • Speaker Presets: Provides over 100 built-in speaker voice presets across supported languages (e.g., v2/en_speaker_0 through v2/en_speaker_9 for English, with similar conventions for other languages like v2/de_speaker_..., v2/es_speaker_..., etc.).
    • History Prompts (Voice Styling): Users can provide an audio snippet (a "history prompt") to guide the vocal style, tone, and acoustic characteristics of the generated audio. This allows for a form of voice stylization or emulation but is not custom voice cloning in the sense of creating a persistent, replicable model of any specific individual's voice from a few seconds of audio. The model attempts to match the acoustic qualities of the prompt.
  • Implicit Prosody and Emotion Control: Due to its generative nature, Bark can often capture emotional nuances and prosody from the text or from history prompts more naturally than some traditional TTS systems. Special text tokens like [I am really sad,] can sometimes influence emotional delivery.
  • Model Sizes:
    • Large Model (suno/bark): Offers the highest quality, typically requires around 12GB of GPU VRAM.
    • Small Model (suno/bark-small): A faster, more lightweight version that trades some quality for reduced resource usage, designed to fit in ~8GB VRAM (or less with optimizations). This can be enabled in the bark library by setting the environment variable SUNO_USE_SMALL_MODELS=True.
  • Open Source: Licensed under the MIT License, allowing for broad use and modification.

Specific Use Cases

Bark's unique capabilities make it suitable for a range of applications:

  • Expressive Voiceovers: Creating rich voiceovers for videos, audiobooks, and podcasts, complete with emotional nuances and non-verbal cues.
  • Game Development: Generating character voices, ambient sounds, and short musical cues for immersive game experiences.
  • Interactive Applications: Powering voice responses in chatbots or virtual assistants that require more natural and varied audio output.
  • Content Creation: Quickly generating audio snippets, sound effects, or short musical pieces for social media, presentations, or artistic projects.
  • Prototyping Voice UIs: Developing and testing voice-based user interfaces with realistic audio feedback.
  • Research in Generative Audio: Providing a powerful open-source model for researchers studying TTS, audio synthesis, and generative AI.
  • Accessibility: Generating spoken versions of text with more natural intonation and expression.

Installation

Bark can be installed directly from its GitHub repository or used via the Hugging Face Transformers library.

  1. Direct Installation from GitHub (Recommended by Suno for the bark library):

    • Important: Do NOT use pip install bark as this installs a different, unrelated package.
    • Using pip with git:
      pip install git+[https://github.com/suno-ai/bark.git](https://github.com/suno-ai/bark.git)
      
    • Or, clone the repository and install locally:
      git clone [https://github.com/suno-ai/bark.git](https://github.com/suno-ai/bark.git)
      cd bark
      pip install .
      
  2. Using with Hugging Face Transformers: Bark is also integrated into the Hugging Face Transformers library (version 4.31.0 or later).

    pip install transformers accelerate scipy
    

    You might also need torch and torchaudio if not already installed.

  3. Dependencies:

    • PyTorch (version 2.0+ recommended).
    • Other Python packages listed in requirements.txt of the Bark repository or as dependencies for Hugging Face Transformers.

Usage (Python API)

Bark is primarily used via its Python API.

Using the bark library directly:

from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
import os

# For smaller model (optional, set before model loading)
# os.environ["SUNO_USE_SMALL_MODELS"] = "True"
# For low VRAM machines (optional, set before model loading)
# os.environ["SUNO_ENABLE_SMALL_MODELS"] = "True" # Alternative flag, check repo for latest
# os.environ["SUNO_OFFLOAD_CPU"] = "True" # Offloads some computation to CPU

# Download and load all models
preload_models() # Recommended to run once to download models

text_prompt = "Hello, my name is Bark. And I can sing! ♪ In the moonlight, shadows dancing... ♪ [laughs]"
# Generate audio from text
audio_array = generate_audio(text_prompt, history_prompt="v2/en_speaker_2") # Using a built-in speaker preset

# Save audio to disk
write_wav("bark_generation.wav", SAMPLE_RATE, audio_array)

# Example with a different speaker and non-speech sound
text_prompt_2 = "This is another test. [sighs] I hope it sounds good."
audio_array_2 = generate_audio(text_prompt_2, history_prompt="v2/es_speaker_5") # Spanish speaker preset
write_wav("bark_generation_es.wav", SAMPLE_RATE, audio_array_2)

Using with Hugging Face Transformers:

from transformers import AutoProcessor, BarkModel
import scipy.io.wavfile
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load processor and model (e.g., suno/bark or suno/bark-small)
processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark").to(device)

# Or for the smaller model:
# processor = AutoProcessor.from_pretrained("suno/bark-small")
# model = BarkModel.from_pretrained("suno/bark-small").to(device)

# Optimize for GPU if available (half-precision, better speed for some operations)
# model = model.to_bettertransformer() # If using older transformers versions with this API
# For newer versions or more direct control:
# model.half() # if device == "cuda"

text_prompt = "Hey there! I'm generating audio using Hugging Face. [clears throat] This is pretty cool. ♪ la la la ♪"
voice_preset = "v2/en_speaker_6"  # Example English speaker preset

inputs = processor(text_prompt, voice_preset=voice_preset, return_tensors="pt").to(device)

# Generate audio
# For longer sequences, enabling attention_mask might be beneficial if supported by your transformers version
# For very long generations, consider generating sentence by sentence to avoid quality degradation or memory issues.
audio_array = model.generate(**inputs, do_sample=True, fine_temperature=0.7, coarse_temperature=0.4, pad_token_id=processor.tokenizer.pad_token_id)

# Convert to numpy array and save
audio_np = audio_array.cpu().numpy().squeeze()
sample_rate = model.config.sample_rate
scipy.io.wavfile.write("bark_hf_generation.wav", rate=sample_rate, data=audio_np)

Hardware Requirements

  • CPU: Bark can run on CPU, but generation will be significantly slower compared to GPU.
  • GPU: Highly recommended for reasonable generation speeds.
    • VRAM:
      • Large Model (suno/bark): Approximately 12GB of GPU VRAM is typically needed for the full models.
      • Small Model (suno/bark-small or using SUNO_USE_SMALL_MODELS=True): Designed to fit in around 8GB of VRAM. Some community reports and optimizations suggest it might run on GPUs with as little as 4-6GB VRAM, especially with techniques like CPU offloading (SUNO_OFFLOAD_CPU=True) or half-precision, though performance might vary.
  • System RAM: A decent amount of system RAM (e.g., 16GB+) is advisable, especially if running larger models or alongside other applications.
  • PyTorch Version: PyTorch 2.0+ is generally recommended.
  • CUDA Version: CUDA 11.7 or 12.0 (or compatible versions) if using NVIDIA GPUs.

License

Bark is released under the MIT License. This is a permissive open-source license that allows for broad use, including modification, distribution, and commercial applications. However, Suno AI notes that users are responsible for the content they generate and should use the model ethically.

Frequently Asked Questions (FAQ)

Q1: What is Bark? A1: Bark is a transformer-based text-to-audio model by Suno AI that can generate realistic multilingual speech, music, sound effects, and other non-verbal sounds from text prompts.

Q2: Can Bark clone any voice from a few seconds of audio? A2: No, Bark does not support custom voice cloning in the sense of creating a new, persistent model of any specific user's voice from a short sample. It uses "history prompts" (which can be pre-defined speaker presets or user-provided audio snippets) to influence the style, tone, and acoustic characteristics of the generated audio. This allows for voice variety and styling but is different from deep, targeted voice cloning.

Q3: What are non-speech sounds in Bark? A3: These are audio elements other than spoken words that Bark can generate. Examples include laughter [laughs], sighs [sighs], crying [cries], musical notes or descriptions ♪ ... ♪, and other sound effects. These are typically prompted using special tokens or descriptions within the input text.

Q4: What hardware do I need to run Bark? A4: While Bark can run on CPU (slowly), a GPU is highly recommended. For the full model, about 12GB of VRAM is ideal. A smaller version targets ~8GB VRAM and might run on less with optimizations like CPU offloading or half-precision.

Q5: How do I install Bark? A5: The recommended way is pip install git+https://github.com/suno-ai/bark.git. Do not use pip install bark as it's an unrelated package. Alternatively, you can use it via the Hugging Face Transformers library.

Q6: Is Bark free to use? A6: Yes, the Bark model and code released by Suno AI on GitHub are open-source under the MIT license, making them free to use.

Q7: Which languages does Bark support? A7: Bark supports multiple languages out-of-the-box and often detects the language automatically from the text. English generally has the highest quality, but many other languages are supported with varying degrees of naturalness. There are speaker presets for languages like German (de), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Polish (pl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).

Community & Support

Ethical Considerations & Safety

  • Responsible Use: Suno AI emphasizes that users are responsible for the content they generate with Bark.
  • Potential for Misuse: As a powerful generative audio model, Bark could be misused to create misleading audio content (deepfakes). Ethical usage is crucial.
  • No Custom Voice Cloning: The lack of deep custom voice cloning for arbitrary voices mitigates some, but not all, risks associated with voice impersonation. The "history prompt" feature should still be used responsibly.
  • Output Variability: Being a generative model, outputs can sometimes be unexpected. The model does not have explicit content moderation beyond what might be inherent in its training data.

Last updated: May 16, 2025

Found an error in our documentation?Email us for assistance