weiruan
#11
by
Gao8
- opened
README.md
CHANGED
@@ -1,12 +1,11 @@
|
|
1 |
---
|
|
|
2 |
language:
|
3 |
- en
|
4 |
- zh
|
5 |
-
license: mit
|
6 |
pipeline_tag: text-to-speech
|
7 |
tags:
|
8 |
- Podcast
|
9 |
-
library_name: transformers
|
10 |
---
|
11 |
|
12 |
## VibeVoice: A Frontier Open-Source Text-to-Speech Model
|
@@ -27,7 +26,7 @@ The model can synthesize speech up to **90 minutes** long with up to **4 distinc
|
|
27 |
<img src="figures/Fig1.png" alt="VibeVoice Overview" height="250px">
|
28 |
</p>
|
29 |
|
30 |
-
## Training
|
31 |
Transformer-based Large Language Model (LLM) integrated with specialized acoustic and semantic tokenizers and a diffusion-based decoding head.
|
32 |
- LLM: [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) for this release.
|
33 |
- Tokenizers:
|
@@ -45,7 +44,7 @@ Transformer-based Large Language Model (LLM) integrated with specialized acousti
|
|
45 |
|-------|----------------|----------|----------|
|
46 |
| VibeVoice-0.5B-Streaming | - | - | On the way |
|
47 |
| VibeVoice-1.5B | 64K | ~90 min | You are here. |
|
48 |
-
| VibeVoice-
|
49 |
|
50 |
## Installation and Usage
|
51 |
|
@@ -53,7 +52,7 @@ Please refer to [GitHub README](https://github.com/microsoft/VibeVoice?tab=readm
|
|
53 |
|
54 |
## Responsible Usage
|
55 |
### Direct intended uses
|
56 |
-
The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the [tech report](https://
|
57 |
|
58 |
### Out-of-scope uses
|
59 |
Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios:
|
|
|
1 |
---
|
2 |
+
license: mit
|
3 |
language:
|
4 |
- en
|
5 |
- zh
|
|
|
6 |
pipeline_tag: text-to-speech
|
7 |
tags:
|
8 |
- Podcast
|
|
|
9 |
---
|
10 |
|
11 |
## VibeVoice: A Frontier Open-Source Text-to-Speech Model
|
|
|
26 |
<img src="figures/Fig1.png" alt="VibeVoice Overview" height="250px">
|
27 |
</p>
|
28 |
|
29 |
+
## Training details
|
30 |
Transformer-based Large Language Model (LLM) integrated with specialized acoustic and semantic tokenizers and a diffusion-based decoding head.
|
31 |
- LLM: [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) for this release.
|
32 |
- Tokenizers:
|
|
|
44 |
|-------|----------------|----------|----------|
|
45 |
| VibeVoice-0.5B-Streaming | - | - | On the way |
|
46 |
| VibeVoice-1.5B | 64K | ~90 min | You are here. |
|
47 |
+
| VibeVoice-7B-Preview| 32K | ~45 min | [HF link](https://huggingface.co/WestZhang/VibeVoice-Large-pt) |
|
48 |
|
49 |
## Installation and Usage
|
50 |
|
|
|
52 |
|
53 |
## Responsible Usage
|
54 |
### Direct intended uses
|
55 |
+
The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the [tech report](https://github.com/microsoft/VibeVoice/blob/main/report/TechnicalReport.pdf).
|
56 |
|
57 |
### Out-of-scope uses
|
58 |
Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios:
|