Hecheng0625 commited on
Commit
c20a2fa
Β·
verified Β·
1 Parent(s): de5a9b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +190 -3
README.md CHANGED
@@ -1,3 +1,190 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - ja
7
+ - fr
8
+ - de
9
+ - ko
10
+ pipeline_tag: text-to-speech
11
+ tags:
12
+ - Speech-Tokenizer
13
+ - Text-to-Speech
14
+ ---
15
+ # πŸš€ TaDiCodec
16
+
17
+ We introduce the **T**ext-**a**ware **Di**ffusion Transformer Speech **Codec** (TaDiCodec), a novel approach to speech tokenization that employs end-to-end optimization for quantization and reconstruction through a **diffusion autoencoder**, while integrating **text guidance** into the diffusion decoder to enhance reconstruction quality and achieve **optimal compression**. TaDiCodec achieves an extremely low frame rate of **6.25 Hz** and a corresponding bitrate of **0.0875 kbps** with a single-layer codebook for **24 kHz speech**, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS).
18
+
19
+
20
+ [![GitHub Stars](https://img.shields.io/github/stars/HeCheng0625/Diffusion-Speech-Tokenizer?style=social)](https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer)
21
+ [![arXiv](https://img.shields.io/badge/arXiv-2024.xxxxx-b31b1b.svg)](https://hecheng0625.github.io/assets/pdf/Arxiv_TaDiCodec.pdf)
22
+ [![Demo](https://img.shields.io/badge/🎬%20Demo-tadicodec-green)](https://tadicodec.github.io/)
23
+ [![Python](https://img.shields.io/badge/Python-3.8+-3776ab.svg)](https://www.python.org/)
24
+ [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
25
+ [![Hugging Face](https://img.shields.io/badge/πŸ€—%20HuggingFace-tadicodec-yellow)](https://huggingface.co/amphion/TaDiCodec)
26
+
27
+ # πŸ€— Pre-trained Models
28
+
29
+ ## πŸ“¦ Model Zoo - Ready to Use!
30
+
31
+ *Download our pre-trained models for instant inference*
32
+
33
+ ## 🎡 TaDiCodec
34
+
35
+ | Model | πŸ€— Hugging Face | πŸ‘· Status |
36
+ |:-----:|:---------------:|:------:|
37
+ | **πŸš€ TaDiCodec** | [![HF](https://img.shields.io/badge/πŸ€—%20HF-TaDiCodec-yellow)](https://huggingface.co/amphion/TaDiCodec) | βœ… |
38
+ | **πŸš€ TaDiCodec-old** | [![HF](https://img.shields.io/badge/πŸ€—%20HF-TaDiCodec--old-yellow)](https://huggingface.co/amphion/TaDiCodec-old) | 🚧 |
39
+
40
+ *Note: TaDiCodec-old is the old version of TaDiCodec, the TaDiCodec-TTS-AR-Phi-3.5-4B is based on TaDiCodec-old.*
41
+
42
+ ## 🎀 TTS Models
43
+
44
+ | Model | Type | LLM | πŸ€— Hugging Face | πŸ‘· Status |
45
+ |:-----:|:----:|:---:|:---------------:|:-------------:|
46
+ | **πŸ€– TaDiCodec-TTS-AR-Qwen2.5-0.5B** | AR | Qwen2.5-0.5B-Instruct | [![HF](https://img.shields.io/badge/πŸ€—%20HF-TaDiCodec--AR--0.5B-yellow)](https://huggingface.co/amphion/TaDiCodec-TTS-AR-Qwen2.5-0.5B) | βœ… |
47
+ | **πŸ€– TaDiCodec-TTS-AR-Qwen2.5-3B** | AR | Qwen2.5-3B-Instruct | [![HF](https://img.shields.io/badge/πŸ€—%20HF-TaDiCodec--AR--3B-yellow)](https://huggingface.co/amphion/TaDiCodec-TTS-AR-Qwen2.5-3B) | βœ… |
48
+ | **πŸ€– TaDiCodec-TTS-AR-Phi-3.5-4B** | AR | Phi-3.5-mini-instruct | [![HF](https://img.shields.io/badge/πŸ€—%20HF-TaDiCodec--AR--4B-yellow)](https://huggingface.co/amphion/TaDiCodec-TTS-AR-Phi-3.5-4B) | 🚧 |
49
+ | **🌊 TaDiCodec-TTS-MGM** | MGM | - | [![HF](https://img.shields.io/badge/πŸ€—%20HF-TaDiCodec--MGM-yellow)](https://huggingface.co/amphion/TaDiCodec-TTS-MGM) | βœ… |
50
+
51
+ ## πŸ”§ Quick Model Usage
52
+
53
+ ```python
54
+ # πŸ€— Load from Hugging Face
55
+ from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
56
+ from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
57
+ from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline
58
+
59
+ # Load TaDiCodec tokenizer, it will automatically download the model from Hugging Face for the first time
60
+ tokenizer = TaDiCodecPipline.from_pretrained("amphion/TaDiCodec")
61
+
62
+ # Load AR TTS model, it will automatically download the model from Hugging Face for the first time
63
+ tts_model = TTSInferencePipeline.from_pretrained("amphion/TaDiCodec-TTS-AR-Qwen2.5-3B")
64
+
65
+ # Load MGM TTS model, it will automatically download the model from Hugging Face for the first time
66
+ tts_model = MGMInferencePipeline.from_pretrained("amphion/TaDiCodec-TTS-MGM")
67
+ ```
68
+
69
+ # πŸš€ Quick Start
70
+
71
+ ## Installation
72
+
73
+ ```bash
74
+ # Clone the repository
75
+ git clone https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer.git
76
+ cd Diffusion-Speech-Tokenizer
77
+
78
+ # Install dependencies
79
+ bash env.sh
80
+ ```
81
+
82
+ ## Basic Usage
83
+
84
+ **Please refer to the [use_examples](https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer/tree/main/use_examples) folder for more detailed usage examples.**
85
+
86
+
87
+ ### Speech Tokenization and Reconstruction
88
+
89
+ ```python
90
+ # Example: Using TaDiCodec for speech tokenization
91
+ import torch
92
+ import soundfile as sf
93
+ from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
94
+
95
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
96
+ pipe = TaDiCodecPipline.from_pretrained(ckpt_dir="./ckpt/TaDiCodec", device=device)
97
+
98
+ # Text of the prompt audio
99
+ prompt_text = "In short, we embarked on a mission to make America great again, for all Americans."
100
+ # Text of the target audio
101
+ target_text = "But to those who knew her well, it was a symbol of her unwavering determination and spirit."
102
+
103
+ # Input audio path of the prompt audio
104
+ prompt_speech_path = "./use_examples/test_audio/trump_0.wav"
105
+ # Input audio path of the target audio
106
+ speech_path = "./use_examples/test_audio/trump_1.wav"
107
+
108
+ rec_audio = pipe(
109
+ text=target_text,
110
+ speech_path=speech_path,
111
+ prompt_text=prompt_text,
112
+ prompt_speech_path=prompt_speech_path
113
+ )
114
+ sf.write("./use_examples/test_audio/trump_rec.wav", rec_audio, 24000)
115
+ ```
116
+
117
+ ### Zero-shot TTS with TaDiCodec
118
+
119
+ ```python
120
+ import torch
121
+ import soundfile as sf
122
+ from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
123
+ # from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline
124
+
125
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
126
+
127
+ # Create AR TTS pipeline
128
+ pipeline = TTSInferencePipeline.from_pretrained(
129
+ tadicodec_path="./ckpt/TaDiCodec",
130
+ llm_path="./ckpt/TaDiCodec-TTS-AR-Qwen2.5-3B",
131
+ device=device,
132
+ )
133
+
134
+ # Inference on single sample, you can also use the MGM TTS pipeline
135
+ audio = pipeline(
136
+ text="δ½†ζ˜― to those who ηŸ₯道 her well, it was a ζ ‡εΏ— of her unwavering 决心 and spirit.", # code-switching cases are supported
137
+ prompt_text="In short, we embarked on a mission to make America great again, for all Americans.",
138
+ prompt_speech_path="./use_examples/test_audio/trump_0.wav",
139
+ )
140
+
141
+ sf.write("./use_examples/test_audio/lm_tts_output.wav", audio, 24000)
142
+ ```
143
+
144
+ # πŸ“š Citation
145
+
146
+ If you find this repository useful, please cite our paper:
147
+
148
+ TaDiCodec:
149
+ ```bibtex
150
+ @article{tadicodec2025,
151
+ title={TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling},
152
+ author={Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu},
153
+ journal={arXiv preprint},
154
+ year={2025},
155
+ url={https://hecheng0625.github.io/assets/pdf/Arxiv_TaDiCodec.pdf}
156
+ }
157
+ ```
158
+
159
+ Amphion:
160
+ ```bibtex
161
+ @inproceedings{amphion,
162
+ author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
163
+ title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
164
+ booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
165
+ year={2024}
166
+ }
167
+ ```
168
+
169
+ MaskGCT:
170
+ ```bibtex
171
+ @inproceedings{wang2024maskgct,
172
+ author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
173
+ title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
174
+ booktitle = {{ICLR}},
175
+ publisher = {OpenReview.net},
176
+ year = {2025}
177
+ }
178
+ ```
179
+
180
+ # πŸ™ Acknowledgments
181
+
182
+ - **MGM-based TTS** is built upon [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct).
183
+
184
+ - **Vocos vocoder** is built upon [Vocos](https://github.com/gemelo-ai/vocos).
185
+
186
+ - **NAR Llama-style transformers** is built upon [transformers](https://github.com/huggingface/transformers).
187
+
188
+ - **(Binary Spherical Quantization) BSQ** is built upon [vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch) and [bsq-vit](https://github.com/zhaoyue-zephyrus/bsq-vit).
189
+
190
+ - **Training codebase** is built upon [Amphion](https://github.com/open-mmlab/Amphion) and [accelerate](https://github.com/huggingface/accelerate).