A text-to-audio generation model that can generate highly realistic multi-language speech and sound effects.
Bark (github.com/suno-ai/bark) is a transformer-based text-to-audio model developed by Suno AI, released in April 2023. It stands out for its ability to generate highly realistic, multilingual speech as well as other audio modalities, including music, background noise, and simple sound effects. Unlike traditional Text-to-Speech (TTS) systems that primarily focus on clean speech, Bark is a more general generative audio model, capable of producing a wide array of audio outputs from text prompts, including non-verbal cues like laughter, sighs, and crying.
The model is open-source (MIT License) and has gained popularity for its natural-sounding output and versatility. While Suno AI's primary product focus has shifted more towards music generation (with their Suno music platform), Bark remains a significant open-source contribution to the field of generative audio.
Bark offers a unique set of features that distinguish it in the landscape of audio generation models:
[laughs]
[sighs]
[cries]
[gasps]
[clears throat]
♪ [text describing music, e.g., "upbeat techno music"] ♪
or ♪ In the jungle, the mighty jungle... ♪
uhm...
, hmm
)[metallic clang]
, [siren]
)v2/en_speaker_0
through v2/en_speaker_9
for English, with similar conventions for other languages like v2/de_speaker_...
, v2/es_speaker_...
, etc.).[I am really sad,]
can sometimes influence emotional delivery.suno/bark
): Offers the highest quality, typically requires around 12GB of GPU VRAM.suno/bark-small
): A faster, more lightweight version that trades some quality for reduced resource usage, designed to fit in ~8GB VRAM (or less with optimizations). This can be enabled in the bark
library by setting the environment variable SUNO_USE_SMALL_MODELS=True
.Bark's unique capabilities make it suitable for a range of applications:
Bark can be installed directly from its GitHub repository or used via the Hugging Face Transformers library.
Direct Installation from GitHub (Recommended by Suno for the bark
library):
pip install bark
as this installs a different, unrelated package.pip install git+[https://github.com/suno-ai/bark.git](https://github.com/suno-ai/bark.git)
git clone [https://github.com/suno-ai/bark.git](https://github.com/suno-ai/bark.git)
cd bark
pip install .
Using with Hugging Face Transformers: Bark is also integrated into the Hugging Face Transformers library (version 4.31.0 or later).
pip install transformers accelerate scipy
You might also need torch
and torchaudio
if not already installed.
Dependencies:
requirements.txt
of the Bark repository or as dependencies for Hugging Face Transformers.Bark is primarily used via its Python API.
bark
library directly:from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
import os
# For smaller model (optional, set before model loading)
# os.environ["SUNO_USE_SMALL_MODELS"] = "True"
# For low VRAM machines (optional, set before model loading)
# os.environ["SUNO_ENABLE_SMALL_MODELS"] = "True" # Alternative flag, check repo for latest
# os.environ["SUNO_OFFLOAD_CPU"] = "True" # Offloads some computation to CPU
# Download and load all models
preload_models() # Recommended to run once to download models
text_prompt = "Hello, my name is Bark. And I can sing! ♪ In the moonlight, shadows dancing... ♪ [laughs]"
# Generate audio from text
audio_array = generate_audio(text_prompt, history_prompt="v2/en_speaker_2") # Using a built-in speaker preset
# Save audio to disk
write_wav("bark_generation.wav", SAMPLE_RATE, audio_array)
# Example with a different speaker and non-speech sound
text_prompt_2 = "This is another test. [sighs] I hope it sounds good."
audio_array_2 = generate_audio(text_prompt_2, history_prompt="v2/es_speaker_5") # Spanish speaker preset
write_wav("bark_generation_es.wav", SAMPLE_RATE, audio_array_2)
from transformers import AutoProcessor, BarkModel
import scipy.io.wavfile
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load processor and model (e.g., suno/bark or suno/bark-small)
processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark").to(device)
# Or for the smaller model:
# processor = AutoProcessor.from_pretrained("suno/bark-small")
# model = BarkModel.from_pretrained("suno/bark-small").to(device)
# Optimize for GPU if available (half-precision, better speed for some operations)
# model = model.to_bettertransformer() # If using older transformers versions with this API
# For newer versions or more direct control:
# model.half() # if device == "cuda"
text_prompt = "Hey there! I'm generating audio using Hugging Face. [clears throat] This is pretty cool. ♪ la la la ♪"
voice_preset = "v2/en_speaker_6" # Example English speaker preset
inputs = processor(text_prompt, voice_preset=voice_preset, return_tensors="pt").to(device)
# Generate audio
# For longer sequences, enabling attention_mask might be beneficial if supported by your transformers version
# For very long generations, consider generating sentence by sentence to avoid quality degradation or memory issues.
audio_array = model.generate(**inputs, do_sample=True, fine_temperature=0.7, coarse_temperature=0.4, pad_token_id=processor.tokenizer.pad_token_id)
# Convert to numpy array and save
audio_np = audio_array.cpu().numpy().squeeze()
sample_rate = model.config.sample_rate
scipy.io.wavfile.write("bark_hf_generation.wav", rate=sample_rate, data=audio_np)
suno/bark
): Approximately 12GB of GPU VRAM is typically needed for the full models.suno/bark-small
or using SUNO_USE_SMALL_MODELS=True
): Designed to fit in around 8GB of VRAM. Some community reports and optimizations suggest it might run on GPUs with as little as 4-6GB VRAM, especially with techniques like CPU offloading (SUNO_OFFLOAD_CPU=True
) or half-precision, though performance might vary.Bark is released under the MIT License. This is a permissive open-source license that allows for broad use, including modification, distribution, and commercial applications. However, Suno AI notes that users are responsible for the content they generate and should use the model ethically.
Q1: What is Bark? A1: Bark is a transformer-based text-to-audio model by Suno AI that can generate realistic multilingual speech, music, sound effects, and other non-verbal sounds from text prompts.
Q2: Can Bark clone any voice from a few seconds of audio? A2: No, Bark does not support custom voice cloning in the sense of creating a new, persistent model of any specific user's voice from a short sample. It uses "history prompts" (which can be pre-defined speaker presets or user-provided audio snippets) to influence the style, tone, and acoustic characteristics of the generated audio. This allows for voice variety and styling but is different from deep, targeted voice cloning.
Q3: What are non-speech sounds in Bark?
A3: These are audio elements other than spoken words that Bark can generate. Examples include laughter [laughs]
, sighs [sighs]
, crying [cries]
, musical notes or descriptions ♪ ... ♪
, and other sound effects. These are typically prompted using special tokens or descriptions within the input text.
Q4: What hardware do I need to run Bark? A4: While Bark can run on CPU (slowly), a GPU is highly recommended. For the full model, about 12GB of VRAM is ideal. A smaller version targets ~8GB VRAM and might run on less with optimizations like CPU offloading or half-precision.
Q5: How do I install Bark?
A5: The recommended way is pip install git+https://github.com/suno-ai/bark.git
. Do not use pip install bark
as it's an unrelated package. Alternatively, you can use it via the Hugging Face Transformers library.
Q6: Is Bark free to use? A6: Yes, the Bark model and code released by Suno AI on GitHub are open-source under the MIT license, making them free to use.
Q7: Which languages does Bark support? A7: Bark supports multiple languages out-of-the-box and often detects the language automatically from the text. English generally has the highest quality, but many other languages are supported with varying degrees of naturalness. There are speaker presets for languages like German (de), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Polish (pl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).
suno/bark
and suno/bark-small
).Last updated: May 16, 2025