Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,121 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc-by-nc-4.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
---
|
| 4 |
+
# DistilCodec
|
| 5 |
+
The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio.
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
[](https://arxiv.org/abs/2408.16532)
|
| 9 |
+
[](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0)
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
# 🔥 News
|
| 13 |
+
- *2025.05.25*: We release the code of DistilCodec-v1.0, including training and inference.
|
| 14 |
+
- *2025.05.23*: We release UniTTS and DistilCodec on [arxiv](https://arxiv.org/abs/2408.16532).
|
| 15 |
+
|
| 16 |
+
## Introduction of DistilCodec
|
| 17 |
+
The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework
|
| 18 |
+
similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure,
|
| 19 |
+
while the vector quantization module implements the GRFVQ scheme. The decoder
|
| 20 |
+
employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. Detailed
|
| 21 |
+
network specifications and layer configurations are provided in Appendix A.1 The training methodol-
|
| 22 |
+
ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of
|
| 23 |
+
discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi-
|
| 24 |
+
STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec:
|
| 25 |
+

|
| 26 |
+
Distribution of DistilCodec training data is shown in below table:
|
| 27 |
+
| **Data Category** | **Data Size (in hours)** |
|
| 28 |
+
|-----------------------------|--------------------------|
|
| 29 |
+
| Chinese Audiobook | 38000 |
|
| 30 |
+
| Chinese Common Audio | 20000 |
|
| 31 |
+
| English Audio | 40000 |
|
| 32 |
+
| Music | 2000 |
|
| 33 |
+
| **Total** | **100000** |
|
| 34 |
+
|
| 35 |
+
## Inference of DistilCodec
|
| 36 |
+
The code is in [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec).
|
| 37 |
+
|
| 38 |
+
### Part1: Generating discrete codecs
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
|
| 42 |
+
from distil_codec import DistilCodec, demo_for_generate_audio_codes
|
| 43 |
+
|
| 44 |
+
codec_model_config_path='path_to_model_config'
|
| 45 |
+
codec_ckpt_path = 'path_to_codec_ckpt_path'
|
| 46 |
+
step=204000
|
| 47 |
+
|
| 48 |
+
codec = DistilCodec.from_pretrained(
|
| 49 |
+
config_path=codec_model_config_path,
|
| 50 |
+
model_path=codec_ckpt_path,
|
| 51 |
+
load_steps=step,
|
| 52 |
+
use_generator=True,
|
| 53 |
+
is_debug=False).eval()
|
| 54 |
+
|
| 55 |
+
audio_path = 'path_to_audio'
|
| 56 |
+
audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000)
|
| 57 |
+
print(audio_tokens)
|
| 58 |
+
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### Part2: Reconstruct audio from raw wav
|
| 62 |
+
```python
|
| 63 |
+
|
| 64 |
+
from distil_codec import DistilCodec, demo_for_generate_audio_codes
|
| 65 |
+
|
| 66 |
+
codec_model_config_path='path_to_model_config'
|
| 67 |
+
codec_ckpt_path = 'path_to_codec_ckpt_path'
|
| 68 |
+
step=204000
|
| 69 |
+
|
| 70 |
+
codec = DistilCodec.from_pretrained(
|
| 71 |
+
config_path=codec_model_config_path,
|
| 72 |
+
model_path=codec_ckpt_path,
|
| 73 |
+
load_steps=step,
|
| 74 |
+
use_generator=True,
|
| 75 |
+
is_debug=False).eval()
|
| 76 |
+
|
| 77 |
+
audio_path = 'path_to_audio'
|
| 78 |
+
audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000)
|
| 79 |
+
print(audio_tokens)
|
| 80 |
+
|
| 81 |
+
# Setup generated audio save path, the path is f'{gen_audio_save_path}/audio_name.wav'
|
| 82 |
+
gen_audio_save_path = 'path_to_save_path'
|
| 83 |
+
audio_name = 'your_audio_name'
|
| 84 |
+
y_gen = codec.decode_from_codes(audio_tokens, minus_token_offset=True)
|
| 85 |
+
codec.save_wav(
|
| 86 |
+
audio_gen_batch=y_gen,
|
| 87 |
+
nhop_lengths=[y_gen.shape[-1]],
|
| 88 |
+
save_path=gen_audio_save_path,
|
| 89 |
+
name_tag=audio_name
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## Available DistilCodec models
|
| 95 |
+
🤗 links to the Huggingface model hub.
|
| 96 |
+
|Model Version| Huggingface | Corpus | Token/s | Domain | Open-Source |
|
| 97 |
+
|-----------------------|---------|---------------|---------------|-----------------------------------|---------------|
|
| 98 |
+
| DistilCodec-v1.0 | [🤗](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) | Universal Audio | 93 | Audiobook、Speech、Audio Effects | √ |
|
| 99 |
+
|
| 100 |
+
## References
|
| 101 |
+
The overall training pipeline of DistilCodec draws inspiration from AcademiCodec, while its encoder and decoder design is adapted from fish-speech. The Vector Quantization (VQ) component implements GRFVQ using the vector-quantize-pytorch framework. These three exceptional works have provided invaluable assistance in our implementation of DistilCodec. Below are links to these reference projects:
|
| 102 |
+
|
| 103 |
+
[1][vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch)
|
| 104 |
+
|
| 105 |
+
[2][AcademiCodec](https://github.com/moewiee/hificodec)
|
| 106 |
+
|
| 107 |
+
[3][fish-speech](https://github.com/fishaudio/fish-speech)
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
## Citation
|
| 111 |
+
|
| 112 |
+
If you find this code useful in your research, please cite our work:
|
| 113 |
+
|
| 114 |
+
```
|
| 115 |
+
@article{wang2025unitts,
|
| 116 |
+
title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information},
|
| 117 |
+
author={Rui Wang,Qianguo Sun,Tianrong Chen,Zhiyun Zeng,Junlong Wu,Jiaxing Zhang},
|
| 118 |
+
journal={arXiv preprint arXiv:2408.16532},
|
| 119 |
+
year={2025}
|
| 120 |
+
}
|
| 121 |
+
```
|