Bark

Overview

Bark is a transformer-based text-to-speech model proposed by Suno AI in suno-ai/bark.

Bark is made of 4 main models:

[BarkSemanticModel] (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
[BarkCoarseModel] (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the [BarkSemanticModel] model. It aims at predicting the first two audio codebooks necessary for EnCodec.
[BarkFineModel] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
having predicted all the codebook channels from the [EncodecModel], Bark uses it to decode the output audio array.

It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.

Tips:

Suno offers a library of voice presets in a number of languages here. These presets are also uploaded in the hub here or here.

>>> from transformers import AutoProcessor, BarkModel

>>> processor = AutoProcessor.from_pretrained("suno/bark")
>>> model = BarkModel.from_pretrained("suno/bark")

>>> voice_preset = "v2/en_speaker_6"

>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects.

>>> # Multilingual speech - simplified Chinese
>>> inputs = processor("惊人的！我会说中文")

>>> # Multilingual speech - French - let's use a voice_preset as well
>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")

>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
>>> inputs = processor("♪ Hello, my dog is cute ♪")

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

The model can also produce nonverbal communications like laughing, sighing and crying.

>>> # Adding non-speech cues to the input text
>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

To save the audio, simply take the sample rate from the model config and some scipy utility:

>>> from scipy.io.wavfile import write as write_wav

>>> # save audio to disk, but first take the sample rate from the model config
>>> sample_rate = model.generation_config.sample_rate
>>> write_wav("bark_generation.wav", sample_rate, audio_array)

This model was contributed by Yoach Lacombe (ylacombe) and Sanchit Gandhi (sanchit-gandhi). The original code can be found here.

BarkConfig

[[autodoc]] BarkConfig - all

BarkProcessor

[[autodoc]] BarkProcessor - all - call

BarkModel

[[autodoc]] BarkModel - generate

BarkSemanticModel

[[autodoc]] BarkSemanticModel - forward

BarkCoarseModel

[[autodoc]] BarkCoarseModel - forward

BarkFineModel

[[autodoc]] BarkFineModel - forward

BarkCausalModel

[[autodoc]] BarkCausalModel - forward

BarkCoarseConfig

[[autodoc]] BarkCoarseConfig - all

BarkFineConfig

[[autodoc]] BarkFineConfig - all

BarkSemanticConfig

[[autodoc]] BarkSemanticConfig - all