Commit
·
b3aa8aa
1
Parent(s):
5e1459d
first commit
Browse files
README.md
CHANGED
@@ -1,7 +1,6 @@
|
|
1 |
---
|
2 |
license: cc-by-nc-sa-4.0
|
3 |
-
|
4 |
-
- AudioCaps+others
|
5 |
language:
|
6 |
- en
|
7 |
tags:
|
@@ -12,7 +11,9 @@ tags:
|
|
12 |
**Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.
|
13 |
|
14 |
📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.
|
|
|
15 |
📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper.
|
|
|
16 |
📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**.
|
17 |
|
18 |
## Auffusion Model Family
|
@@ -76,7 +77,8 @@ generator = torch.Generator(device=device).manual_seed(seed)
|
|
76 |
with torch.autocast("cuda"):
|
77 |
output_spec = pipe(
|
78 |
prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt"
|
79 |
-
).images[0]
|
|
|
80 |
|
81 |
|
82 |
denorm_spec = denormalize_spectrogram(output_spec)
|
|
|
1 |
---
|
2 |
license: cc-by-nc-sa-4.0
|
3 |
+
|
|
|
4 |
language:
|
5 |
- en
|
6 |
tags:
|
|
|
11 |
**Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.
|
12 |
|
13 |
📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.
|
14 |
+
|
15 |
📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper.
|
16 |
+
|
17 |
📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**.
|
18 |
|
19 |
## Auffusion Model Family
|
|
|
77 |
with torch.autocast("cuda"):
|
78 |
output_spec = pipe(
|
79 |
prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt"
|
80 |
+
).images[0]
|
81 |
+
# important to set output_type="pt" to get torch tensor output, and set height=256 with width=1024
|
82 |
|
83 |
|
84 |
denorm_spec = denormalize_spectrogram(output_spec)
|