File size: 5,052 Bytes
a8d03d2
 
b3aa8aa
5e1459d
 
 
 
a8d03d2
5e1459d
 
 
 
 
b3aa8aa
5e1459d
b3aa8aa
5e1459d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b3aa8aa
 
5e1459d
 
 
 
 
 
 
 
 
 
 
a75ebf2
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: cc-by-nc-sa-4.0

language:
- en
tags:
- audio
---
# 

**Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.

📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.

📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper.

📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**.

## Auffusion Model Family

| Model Name                 | Model Path                                                                                                              |
|----------------------------|------------------------------------------------------------------------------------------------------------------------ |
| Auffusion                  | [https://huggingface.co/auffusion/auffusion](https://huggingface.co/auffusion/auffusion)                                |
| Auffusion-Full             | [https://huggingface.co/auffusion/auffusion-full](https://huggingface.co/auffusion/auffusion-full)                      |
| Auffusion-Full-no-adapter  | [https://huggingface.co/auffusion/auffusion-full-no-adapter](https://huggingface.co/auffusion/auffusion-full-no-adapter)|


## Code

Our code is released here: [https://github.com/happylittlecat2333/Auffusion](https://github.com/happylittlecat2333/Auffusion)

We uploaded several **Auffusion** generated samples here: [https://auffusion.github.io](https://auffusion.github.io)

Please follow the instructions in the repository for installation, usage and experiments.


## Quickstart Guide

We try to make **Auffusion-Full-no-adapter** compatible with text-to-image pipeline, therefore diffusers pipeline including StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, StableDiffusionInpaintPipeline etc. can be adapted. Other audio manipulation examples can be seen in [https://github.com/happylittlecat2333/Auffusion/notebooks](https://github.com/happylittlecat2333/Auffusion/notebooks). We only show the default text-to-audio example here.


First, git clone the repository and install the requirements:

```bash
git clone https://github.com/happylittlecat2333/Auffusion/
cd Auffusion
pip install -r requirements.txt
```

Then, download the **Auffusion-Full-no-adapter** model and generate audio from a text prompt:

```python
import IPython, torch, os
import soundfile as sf
from diffusers import StableDiffusionPipeline
from huggingface_hub import snapshot_download
from converter import Generator, denormalize_spectrogram

cuda = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16

prompt = "A kitten mewing for attention"
seed = 42

pretrained_model_name_or_path = "auffusion/auffusion-full-no-adapter"
if not os.path.isdir(pretrained_model_name_or_path):
    pretrained_model_name_or_path = snapshot_download(pretrained_model_name_or_path) 

vocoder = Generator.from_pretrained(pretrained_model_name_or_path, subfolder="vocoder")
vocoder = vocoder.to(device=device, dtype=dtype)

pipe = StableDiffusionPipeline.from_pretrained(pretrained_model_name_or_path, torch_dtype=dtype)
pipe = pipe.to(device)

generator = torch.Generator(device=device).manual_seed(seed)

with torch.autocast("cuda"):
    output_spec = pipe(
        prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt"
    ).images[0] 
    # important to set output_type="pt" to get torch tensor output, and set height=256 with width=1024


denorm_spec = denormalize_spectrogram(output_spec)
denorm_spec_audio = vocoder.inference(denorm_spec)

sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
```

The auffusion model will be automatically downloaded from huggingface and saved in cache. Subsequent runs will load the model directly from cache.

Other audio manipulation examples can be seen in [https://github.com/happylittlecat2333/Auffusion/notebooks](https://github.com/happylittlecat2333/Auffusion/notebooks). We only show the default text-to-audio example here.

##  Citation

Please consider citing the following article if you found our work useful:

```bibtex
@article{xue2024auffusion,
  title={Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation}, 
  author={Jinlong Xue and Yayue Deng and Yingming Gao and Ya Li},
  journal={arXiv preprint arXiv:2401.01044},
  year={2024}
}
```