microsoft
/

VibeVoice-1.5B

text-generation

Model card Files Files and versions

weiruan

#11

by Gao8 - opened 7 days ago

base: refs/heads/main

←

from: refs/pr/11

Discussion Files changed

This PR is in draft mode

Files changed (1) hide show

README.md +4 -5

README.md CHANGED Viewed

@@ -1,12 +1,11 @@
 ---
 language:
 - en
 - zh
-license: mit
 pipeline_tag: text-to-speech
 tags:
 - Podcast
-library_name: transformers
 ---
 ## VibeVoice: A Frontier Open-Source Text-to-Speech Model
@@ -27,7 +26,7 @@ The model can synthesize speech up to **90 minutes** long with up to **4 distinc
   <img src="figures/Fig1.png" alt="VibeVoice Overview" height="250px">
 </p>
-## Training Details
 Transformer-based Large Language Model (LLM) integrated with specialized acoustic and semantic tokenizers and a diffusion-based decoding head.
 - LLM: [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) for this release.
 - Tokenizers:
@@ -45,7 +44,7 @@ Transformer-based Large Language Model (LLM) integrated with specialized acousti
 |-------|----------------|----------|----------|
 | VibeVoice-0.5B-Streaming | - | - | On the way |
 | VibeVoice-1.5B | 64K | ~90 min | You are here. |
-| VibeVoice-Large| 32K | ~45 min | [HF link](https://huggingface.co/microsoft/VibeVoice-Large) |
 ## Installation and Usage
@@ -53,7 +52,7 @@ Please refer to [GitHub README](https://github.com/microsoft/VibeVoice?tab=readm
 ## Responsible Usage
 ### Direct intended uses
-The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the [tech report](https://arxiv.org/pdf/2508.19205).
 ### Out-of-scope uses
 Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios:

 ---
+license: mit
 language:
 - en
 - zh
 pipeline_tag: text-to-speech
 tags:
 - Podcast
 ---
 ## VibeVoice: A Frontier Open-Source Text-to-Speech Model
   <img src="figures/Fig1.png" alt="VibeVoice Overview" height="250px">
 </p>
+## Training details
 Transformer-based Large Language Model (LLM) integrated with specialized acoustic and semantic tokenizers and a diffusion-based decoding head.
 - LLM: [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) for this release.
 - Tokenizers:
 |-------|----------------|----------|----------|
 | VibeVoice-0.5B-Streaming | - | - | On the way |
 | VibeVoice-1.5B | 64K | ~90 min | You are here. |
+| VibeVoice-7B-Preview| 32K | ~45 min | [HF link](https://huggingface.co/WestZhang/VibeVoice-Large-pt) |
 ## Installation and Usage
 ## Responsible Usage
 ### Direct intended uses
+The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the [tech report](https://github.com/microsoft/VibeVoice/blob/main/report/TechnicalReport.pdf).
 ### Out-of-scope uses
 Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios: