auffusion
/

auffusion-full-no-adapter

StableDiffusionPipeline

Inference Endpoints

Model card Files Files and versions Community

happpylittlecat commited on Jan 1, 2024

Commit

b3aa8aa

·

1 Parent(s): 5e1459d

first commit

Files changed (1) hide show

README.md +5 -3

README.md CHANGED Viewed

@@ -1,7 +1,6 @@
 ---
 license: cc-by-nc-sa-4.0
-datasets:
-- AudioCaps+others
 language:
 - en
 tags:
@@ -12,7 +11,9 @@ tags:
 **Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.
 📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.
 📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper.
 📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**.
 ## Auffusion Model Family
@@ -76,7 +77,8 @@ generator = torch.Generator(device=device).manual_seed(seed)
 with torch.autocast("cuda"):
     output_spec = pipe(
         prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt"
-    ).images[0] # important to set output_type="pt" to get torch tensor output, and set height=256 with width=1024
 denorm_spec = denormalize_spectrogram(output_spec)

 ---
 license: cc-by-nc-sa-4.0
 language:
 - en
 tags:
 **Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.
 📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.
 📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper.
 📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**.
 ## Auffusion Model Family
 with torch.autocast("cuda"):
     output_spec = pipe(
         prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt"
+    ).images[0]
+    # important to set output_type="pt" to get torch tensor output, and set height=256 with width=1024
 denorm_spec = denormalize_spectrogram(output_spec)