Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,190 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- zh
|
6 |
+
- ja
|
7 |
+
- fr
|
8 |
+
- de
|
9 |
+
- ko
|
10 |
+
pipeline_tag: text-to-speech
|
11 |
+
tags:
|
12 |
+
- Speech-Tokenizer
|
13 |
+
- Text-to-Speech
|
14 |
+
---
|
15 |
+
# π TaDiCodec
|
16 |
+
|
17 |
+
We introduce the **T**ext-**a**ware **Di**ffusion Transformer Speech **Codec** (TaDiCodec), a novel approach to speech tokenization that employs end-to-end optimization for quantization and reconstruction through a **diffusion autoencoder**, while integrating **text guidance** into the diffusion decoder to enhance reconstruction quality and achieve **optimal compression**. TaDiCodec achieves an extremely low frame rate of **6.25 Hz** and a corresponding bitrate of **0.0875 kbps** with a single-layer codebook for **24 kHz speech**, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS).
|
18 |
+
|
19 |
+
|
20 |
+
[](https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer)
|
21 |
+
[](https://hecheng0625.github.io/assets/pdf/Arxiv_TaDiCodec.pdf)
|
22 |
+
[](https://tadicodec.github.io/)
|
23 |
+
[](https://www.python.org/)
|
24 |
+
[](https://pytorch.org/)
|
25 |
+
[](https://huggingface.co/amphion/TaDiCodec)
|
26 |
+
|
27 |
+
# π€ Pre-trained Models
|
28 |
+
|
29 |
+
## π¦ Model Zoo - Ready to Use!
|
30 |
+
|
31 |
+
*Download our pre-trained models for instant inference*
|
32 |
+
|
33 |
+
## π΅ TaDiCodec
|
34 |
+
|
35 |
+
| Model | π€ Hugging Face | π· Status |
|
36 |
+
|:-----:|:---------------:|:------:|
|
37 |
+
| **π TaDiCodec** | [](https://huggingface.co/amphion/TaDiCodec) | β
|
|
38 |
+
| **π TaDiCodec-old** | [](https://huggingface.co/amphion/TaDiCodec-old) | π§ |
|
39 |
+
|
40 |
+
*Note: TaDiCodec-old is the old version of TaDiCodec, the TaDiCodec-TTS-AR-Phi-3.5-4B is based on TaDiCodec-old.*
|
41 |
+
|
42 |
+
## π€ TTS Models
|
43 |
+
|
44 |
+
| Model | Type | LLM | π€ Hugging Face | π· Status |
|
45 |
+
|:-----:|:----:|:---:|:---------------:|:-------------:|
|
46 |
+
| **π€ TaDiCodec-TTS-AR-Qwen2.5-0.5B** | AR | Qwen2.5-0.5B-Instruct | [](https://huggingface.co/amphion/TaDiCodec-TTS-AR-Qwen2.5-0.5B) | β
|
|
47 |
+
| **π€ TaDiCodec-TTS-AR-Qwen2.5-3B** | AR | Qwen2.5-3B-Instruct | [](https://huggingface.co/amphion/TaDiCodec-TTS-AR-Qwen2.5-3B) | β
|
|
48 |
+
| **π€ TaDiCodec-TTS-AR-Phi-3.5-4B** | AR | Phi-3.5-mini-instruct | [](https://huggingface.co/amphion/TaDiCodec-TTS-AR-Phi-3.5-4B) | π§ |
|
49 |
+
| **π TaDiCodec-TTS-MGM** | MGM | - | [](https://huggingface.co/amphion/TaDiCodec-TTS-MGM) | β
|
|
50 |
+
|
51 |
+
## π§ Quick Model Usage
|
52 |
+
|
53 |
+
```python
|
54 |
+
# π€ Load from Hugging Face
|
55 |
+
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
|
56 |
+
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
|
57 |
+
from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline
|
58 |
+
|
59 |
+
# Load TaDiCodec tokenizer, it will automatically download the model from Hugging Face for the first time
|
60 |
+
tokenizer = TaDiCodecPipline.from_pretrained("amphion/TaDiCodec")
|
61 |
+
|
62 |
+
# Load AR TTS model, it will automatically download the model from Hugging Face for the first time
|
63 |
+
tts_model = TTSInferencePipeline.from_pretrained("amphion/TaDiCodec-TTS-AR-Qwen2.5-3B")
|
64 |
+
|
65 |
+
# Load MGM TTS model, it will automatically download the model from Hugging Face for the first time
|
66 |
+
tts_model = MGMInferencePipeline.from_pretrained("amphion/TaDiCodec-TTS-MGM")
|
67 |
+
```
|
68 |
+
|
69 |
+
# π Quick Start
|
70 |
+
|
71 |
+
## Installation
|
72 |
+
|
73 |
+
```bash
|
74 |
+
# Clone the repository
|
75 |
+
git clone https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer.git
|
76 |
+
cd Diffusion-Speech-Tokenizer
|
77 |
+
|
78 |
+
# Install dependencies
|
79 |
+
bash env.sh
|
80 |
+
```
|
81 |
+
|
82 |
+
## Basic Usage
|
83 |
+
|
84 |
+
**Please refer to the [use_examples](https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer/tree/main/use_examples) folder for more detailed usage examples.**
|
85 |
+
|
86 |
+
|
87 |
+
### Speech Tokenization and Reconstruction
|
88 |
+
|
89 |
+
```python
|
90 |
+
# Example: Using TaDiCodec for speech tokenization
|
91 |
+
import torch
|
92 |
+
import soundfile as sf
|
93 |
+
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
|
94 |
+
|
95 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
96 |
+
pipe = TaDiCodecPipline.from_pretrained(ckpt_dir="./ckpt/TaDiCodec", device=device)
|
97 |
+
|
98 |
+
# Text of the prompt audio
|
99 |
+
prompt_text = "In short, we embarked on a mission to make America great again, for all Americans."
|
100 |
+
# Text of the target audio
|
101 |
+
target_text = "But to those who knew her well, it was a symbol of her unwavering determination and spirit."
|
102 |
+
|
103 |
+
# Input audio path of the prompt audio
|
104 |
+
prompt_speech_path = "./use_examples/test_audio/trump_0.wav"
|
105 |
+
# Input audio path of the target audio
|
106 |
+
speech_path = "./use_examples/test_audio/trump_1.wav"
|
107 |
+
|
108 |
+
rec_audio = pipe(
|
109 |
+
text=target_text,
|
110 |
+
speech_path=speech_path,
|
111 |
+
prompt_text=prompt_text,
|
112 |
+
prompt_speech_path=prompt_speech_path
|
113 |
+
)
|
114 |
+
sf.write("./use_examples/test_audio/trump_rec.wav", rec_audio, 24000)
|
115 |
+
```
|
116 |
+
|
117 |
+
### Zero-shot TTS with TaDiCodec
|
118 |
+
|
119 |
+
```python
|
120 |
+
import torch
|
121 |
+
import soundfile as sf
|
122 |
+
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
|
123 |
+
# from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline
|
124 |
+
|
125 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
126 |
+
|
127 |
+
# Create AR TTS pipeline
|
128 |
+
pipeline = TTSInferencePipeline.from_pretrained(
|
129 |
+
tadicodec_path="./ckpt/TaDiCodec",
|
130 |
+
llm_path="./ckpt/TaDiCodec-TTS-AR-Qwen2.5-3B",
|
131 |
+
device=device,
|
132 |
+
)
|
133 |
+
|
134 |
+
# Inference on single sample, you can also use the MGM TTS pipeline
|
135 |
+
audio = pipeline(
|
136 |
+
text="δ½ζ― to those who η₯ι her well, it was a ζ εΏ of her unwavering ε³εΏ and spirit.", # code-switching cases are supported
|
137 |
+
prompt_text="In short, we embarked on a mission to make America great again, for all Americans.",
|
138 |
+
prompt_speech_path="./use_examples/test_audio/trump_0.wav",
|
139 |
+
)
|
140 |
+
|
141 |
+
sf.write("./use_examples/test_audio/lm_tts_output.wav", audio, 24000)
|
142 |
+
```
|
143 |
+
|
144 |
+
# π Citation
|
145 |
+
|
146 |
+
If you find this repository useful, please cite our paper:
|
147 |
+
|
148 |
+
TaDiCodec:
|
149 |
+
```bibtex
|
150 |
+
@article{tadicodec2025,
|
151 |
+
title={TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling},
|
152 |
+
author={Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu},
|
153 |
+
journal={arXiv preprint},
|
154 |
+
year={2025},
|
155 |
+
url={https://hecheng0625.github.io/assets/pdf/Arxiv_TaDiCodec.pdf}
|
156 |
+
}
|
157 |
+
```
|
158 |
+
|
159 |
+
Amphion:
|
160 |
+
```bibtex
|
161 |
+
@inproceedings{amphion,
|
162 |
+
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
|
163 |
+
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
|
164 |
+
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
|
165 |
+
year={2024}
|
166 |
+
}
|
167 |
+
```
|
168 |
+
|
169 |
+
MaskGCT:
|
170 |
+
```bibtex
|
171 |
+
@inproceedings{wang2024maskgct,
|
172 |
+
author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
|
173 |
+
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
|
174 |
+
booktitle = {{ICLR}},
|
175 |
+
publisher = {OpenReview.net},
|
176 |
+
year = {2025}
|
177 |
+
}
|
178 |
+
```
|
179 |
+
|
180 |
+
# π Acknowledgments
|
181 |
+
|
182 |
+
- **MGM-based TTS** is built upon [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct).
|
183 |
+
|
184 |
+
- **Vocos vocoder** is built upon [Vocos](https://github.com/gemelo-ai/vocos).
|
185 |
+
|
186 |
+
- **NAR Llama-style transformers** is built upon [transformers](https://github.com/huggingface/transformers).
|
187 |
+
|
188 |
+
- **(Binary Spherical Quantization) BSQ** is built upon [vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch) and [bsq-vit](https://github.com/zhaoyue-zephyrus/bsq-vit).
|
189 |
+
|
190 |
+
- **Training codebase** is built upon [Amphion](https://github.com/open-mmlab/Amphion) and [accelerate](https://github.com/huggingface/accelerate).
|