sbapan41 commited on
Commit
5359c6a
·
verified ·
1 Parent(s): 9bd4171

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +148 -0
  2. config.json +117 -0
  3. gitattributes +35 -0
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-speech
4
+ library_name: zonos
5
+ ---
6
+ # Zonos-v0.1
7
+
8
+ <div align="center">
9
+ <img src="https://github.com/Zyphra/Zonos/blob/main/assets/ZonosHeader.png?raw=true"
10
+ alt="Title card"
11
+ style="width: 500px;
12
+ height: auto;
13
+ object-position: center top;">
14
+ </div>
15
+
16
+ ---
17
+
18
+ Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.
19
+
20
+ Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.
21
+
22
+ ##### For more details and speech samples, check out our blog [here](https://www.zyphra.com/post/beta-release-of-zonos-v0-1)
23
+
24
+ ##### We also have a hosted version available at [playground.zyphra.com/audio](https://playground.zyphra.com/audio)
25
+
26
+ ---
27
+
28
+ Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An overview of the architecture can be seen below.
29
+
30
+ <div align="center">
31
+ <img src="https://github.com/Zyphra/Zonos/blob/main/assets/ArchitectureDiagram.png?raw=true"
32
+ alt="Architecture diagram"
33
+ style="width: 1000px;
34
+ height: auto;
35
+ object-position: center top;">
36
+ </div>
37
+
38
+ ---
39
+
40
+ ## Usage
41
+
42
+ ### Python
43
+
44
+ ```python
45
+ import torch
46
+ import torchaudio
47
+ from zonos.model import Zonos
48
+ from zonos.conditioning import make_cond_dict
49
+
50
+ model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
51
+ # model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
52
+
53
+ wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
54
+ speaker = model.make_speaker_embedding(wav, sampling_rate)
55
+
56
+ cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
57
+ conditioning = model.prepare_conditioning(cond_dict)
58
+
59
+ codes = model.generate(conditioning)
60
+
61
+ wavs = model.autoencoder.decode(codes).cpu()
62
+ torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
63
+ ```
64
+
65
+ ### Gradio interface (recommended)
66
+
67
+ ```bash
68
+ uv run gradio_interface.py
69
+ # python gradio_interface.py
70
+ ```
71
+
72
+ This should produce a `sample.wav` file in your project root directory.
73
+
74
+ _For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run._
75
+
76
+ ## Features
77
+
78
+ - Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
79
+ - Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which can otherwise be challenging to replicate when cloning from speaker embeddings
80
+ - Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
81
+ - Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
82
+ - Fast: our model runs with a real-time factor of ~2x on an RTX 4090
83
+ - Gradio WebUI: Zonos comes packaged with an easy to use gradio interface to generate speech
84
+ - Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.
85
+
86
+ ## Installation
87
+
88
+ **At the moment this repository only supports Linux systems (preferably Ubuntu 22.04/24.04) with recent NVIDIA GPUs (3000-series or newer, 6GB+ VRAM).**
89
+
90
+ See also [Docker Installation](#docker-installation)
91
+
92
+ #### System dependencies
93
+
94
+ Zonos depends on the eSpeak library phonemization. You can install it on Ubuntu with the following command:
95
+
96
+ ```bash
97
+ apt install -y espeak-ng
98
+ ```
99
+
100
+ #### Python dependencies
101
+
102
+ We highly recommend using a recent version of [uv](https://docs.astral.sh/uv/#installation) for installation. If you don't have uv installed, you can install it via pip: `pip install -U uv`.
103
+
104
+ ##### Installing into a new uv virtual environment (recommended)
105
+
106
+ ```bash
107
+ uv sync
108
+ uv sync --extra compile
109
+ ```
110
+
111
+ ##### Installing into the system/actived environment using uv
112
+
113
+ ```bash
114
+ uv pip install -e .
115
+ uv pip install -e .[compile]
116
+ ```
117
+
118
+ ##### Installing into the system/actived environment using pip
119
+
120
+ ```bash
121
+ pip install -e .
122
+ pip install --no-build-isolation -e .[compile]
123
+ ```
124
+
125
+ ##### Confirm that it's working
126
+
127
+ For convenience we provide a minimal example to check that the installation works:
128
+
129
+ ```bash
130
+ uv run sample.py
131
+ # python sample.py
132
+ ```
133
+
134
+ ## Docker installation
135
+
136
+ ```bash
137
+ git clone https://github.com/Zyphra/Zonos.git
138
+ cd Zonos
139
+
140
+ # For gradio
141
+ docker compose up
142
+
143
+ # Or for development you can do
144
+ docker build -t Zonos .
145
+ docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
146
+ cd /Zonos
147
+ python sample.py # this will generate a sample.wav in /Zonos
148
+ ```
config.json ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backbone": {
3
+ "d_model": 2048,
4
+ "d_intermediate": 0,
5
+ "attn_mlp_d_intermediate": 8192,
6
+ "n_layer": 46,
7
+ "ssm_cfg": {
8
+ "layer": "Mamba2"
9
+ },
10
+ "attn_layer_idx": [
11
+ 0,
12
+ 4,
13
+ 8,
14
+ 12,
15
+ 16,
16
+ 20,
17
+ 24,
18
+ 28,
19
+ 32,
20
+ 36,
21
+ 40,
22
+ 44
23
+ ],
24
+ "attn_cfg": {
25
+ "causal": true,
26
+ "num_heads": 16,
27
+ "num_heads_kv": 4,
28
+ "rotary_emb_dim": 128,
29
+ "qkv_proj_bias": false,
30
+ "out_proj_bias": false
31
+ },
32
+ "rms_norm": false,
33
+ "residual_in_fp32": false,
34
+ "norm_epsilon": 1e-05
35
+ },
36
+ "prefix_conditioner": {
37
+ "conditioners": [
38
+ {
39
+ "type": "EspeakPhonemeConditioner",
40
+ "name": "espeak"
41
+ },
42
+ {
43
+ "cond_dim": 128,
44
+ "uncond_type": "learned",
45
+ "projection": "linear",
46
+ "type": "PassthroughConditioner",
47
+ "name": "speaker"
48
+ },
49
+ {
50
+ "input_dim": 8,
51
+ "uncond_type": "learned",
52
+ "type": "FourierConditioner",
53
+ "name": "emotion"
54
+ },
55
+ {
56
+ "min_val": 0,
57
+ "max_val": 24000,
58
+ "uncond_type": "learned",
59
+ "type": "FourierConditioner",
60
+ "name": "fmax"
61
+ },
62
+ {
63
+ "min_val": 0,
64
+ "max_val": 400,
65
+ "uncond_type": "learned",
66
+ "type": "FourierConditioner",
67
+ "name": "pitch_std"
68
+ },
69
+ {
70
+ "min_val": 0,
71
+ "max_val": 40,
72
+ "uncond_type": "learned",
73
+ "type": "FourierConditioner",
74
+ "name": "speaking_rate"
75
+ },
76
+ {
77
+ "min_val": -1,
78
+ "max_val": 126,
79
+ "uncond_type": "learned",
80
+ "type": "IntegerConditioner",
81
+ "name": "language_id"
82
+ },
83
+ {
84
+ "input_dim": 8,
85
+ "min_val": 0.5,
86
+ "max_val": 0.8,
87
+ "uncond_type": "learned",
88
+ "type": "FourierConditioner",
89
+ "name": "vqscore_8"
90
+ },
91
+ {
92
+ "min_val": -1.0,
93
+ "max_val": 1000,
94
+ "uncond_type": "learned",
95
+ "type": "FourierConditioner",
96
+ "name": "ctc_loss"
97
+ },
98
+ {
99
+ "min_val": 1,
100
+ "max_val": 5,
101
+ "uncond_type": "learned",
102
+ "type": "FourierConditioner",
103
+ "name": "dnsmos_ovrl"
104
+ },
105
+ {
106
+ "min_val": 0,
107
+ "max_val": 1,
108
+ "uncond_type": "learned",
109
+ "type": "IntegerConditioner",
110
+ "name": "speaker_noised"
111
+ }
112
+ ],
113
+ "projection": "linear"
114
+ },
115
+ "eos_token_id": 1024,
116
+ "masked_token_id": 1025
117
+ }
gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text