happpylittlecat commited on
Commit
b3aa8aa
·
1 Parent(s): 5e1459d

first commit

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -1,7 +1,6 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
3
- datasets:
4
- - AudioCaps+others
5
  language:
6
  - en
7
  tags:
@@ -12,7 +11,9 @@ tags:
12
  **Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.
13
 
14
  📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.
 
15
  📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper.
 
16
  📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**.
17
 
18
  ## Auffusion Model Family
@@ -76,7 +77,8 @@ generator = torch.Generator(device=device).manual_seed(seed)
76
  with torch.autocast("cuda"):
77
  output_spec = pipe(
78
  prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt"
79
- ).images[0] # important to set output_type="pt" to get torch tensor output, and set height=256 with width=1024
 
80
 
81
 
82
  denorm_spec = denormalize_spectrogram(output_spec)
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+
 
4
  language:
5
  - en
6
  tags:
 
11
  **Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.
12
 
13
  📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.
14
+
15
  📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper.
16
+
17
  📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**.
18
 
19
  ## Auffusion Model Family
 
77
  with torch.autocast("cuda"):
78
  output_spec = pipe(
79
  prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt"
80
+ ).images[0]
81
+ # important to set output_type="pt" to get torch tensor output, and set height=256 with width=1024
82
 
83
 
84
  denorm_spec = denormalize_spectrogram(output_spec)