IDEA-Emdoor
/

DistilCodec-v1.0

Model card Files Files and versions

xet

Community

Ray0323 commited on May 22

Commit

2a370f4

verified ·

1 Parent(s): 5a4507f

Update README.md

Browse files

Files changed (1) hide show

README.md +121 -3

README.md CHANGED Viewed

@@ -1,3 +1,121 @@
----
-license: cc-by-nc-4.0
----

+---
+license: cc-by-nc-4.0
+---
+# DistilCodec
+The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio.
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2408.16532)
+[![model](https://img.shields.io/badge/%F0%9F%A4%97%20DistilCodec-Models-blue)](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0)
+# 🔥 News
+- *2025.05.25*: We release the code of DistilCodec-v1.0, including training and inference.
+- *2025.05.23*: We release UniTTS and DistilCodec on [arxiv](https://arxiv.org/abs/2408.16532).
+## Introduction of DistilCodec
+The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework
+similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure,
+while the vector quantization module implements the GRFVQ scheme. The decoder
+employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. Detailed
+network specifications and layer configurations are provided in Appendix A.1 The training methodol-
+ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of
+discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi-
+STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec:
+![The Architecture of DistilCodec](./data/distilcodec_architecture.jpg)
+Distribution of DistilCodec training data is shown in below table:
+| **Data Category**           | **Data Size (in hours)** |
+|-----------------------------|--------------------------|
+| Chinese Audiobook            | 38000                    |
+| Chinese Common Audio         | 20000                    |
+| English Audio               | 40000                    |
+| Music                       | 2000                     |
+| **Total**                   | **100000**               |
+## Inference of DistilCodec
+The code is in [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec).
+### Part1: Generating discrete codecs
+```python
+from distil_codec import DistilCodec, demo_for_generate_audio_codes
+codec_model_config_path='path_to_model_config'
+codec_ckpt_path = 'path_to_codec_ckpt_path'
+step=204000
+codec = DistilCodec.from_pretrained(
+    config_path=codec_model_config_path,
+    model_path=codec_ckpt_path,
+    load_steps=step,
+    use_generator=True,
+    is_debug=False).eval()
+audio_path = 'path_to_audio'
+audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000)
+print(audio_tokens)
+```
+### Part2: Reconstruct audio from raw wav
+```python
+from distil_codec import DistilCodec, demo_for_generate_audio_codes
+codec_model_config_path='path_to_model_config'
+codec_ckpt_path = 'path_to_codec_ckpt_path'
+step=204000
+codec = DistilCodec.from_pretrained(
+    config_path=codec_model_config_path,
+    model_path=codec_ckpt_path,
+    load_steps=step,
+    use_generator=True,
+    is_debug=False).eval()
+audio_path = 'path_to_audio'
+audio_tokens = demo_for_generate_audio_codes(codec, audio_path, target_sr=24000)
+print(audio_tokens)
+# Setup generated audio save path, the path is f'{gen_audio_save_path}/audio_name.wav'
+gen_audio_save_path = 'path_to_save_path'
+audio_name = 'your_audio_name'
+y_gen = codec.decode_from_codes(audio_tokens, minus_token_offset=True)
+codec.save_wav(
+    audio_gen_batch=y_gen,
+    nhop_lengths=[y_gen.shape[-1]],
+    save_path=gen_audio_save_path,
+    name_tag=audio_name
+)
+```
+## Available DistilCodec models
+🤗 links to the Huggingface model hub.
+|Model Version| Huggingface |  Corpus  |  Token/s  | Domain | Open-Source |
+|-----------------------|---------|---------------|---------------|-----------------------------------|---------------|
+| DistilCodec-v1.0 | [🤗](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) | Universal Audio | 93 |  Audiobook、Speech、Audio Effects | √ |
+## References
+The overall training pipeline of DistilCodec draws inspiration from AcademiCodec, while its encoder and decoder design is adapted from fish-speech. The Vector Quantization (VQ) component implements GRFVQ using the vector-quantize-pytorch framework. These three exceptional works have provided invaluable assistance in our implementation of DistilCodec. Below are links to these reference projects:
+[1][vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch)
+[2][AcademiCodec](https://github.com/moewiee/hificodec)
+[3][fish-speech](https://github.com/fishaudio/fish-speech)
+## Citation
+If you find this code useful in your research, please cite our work:
+```
+@article{wang2025unitts,
+  title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information},
+  author={Rui Wang,Qianguo Sun,Tianrong Chen,Zhiyun Zeng,Junlong Wu,Jiaxing Zhang},
+  journal={arXiv preprint arXiv:2408.16532},
+  year={2025}
+}
+```