Text-to-Speech
Safetensors
English
llama
edwko commited on
Commit
c9ba2bf
·
verified ·
1 Parent(s): 2800f48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +201 -3
README.md CHANGED
@@ -1,3 +1,201 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ ---
4
+ <style>
5
+ table {
6
+ border-collapse: collapse;
7
+ width: 100%;
8
+ margin-bottom: 20px;
9
+ }
10
+ th, td {
11
+ border: 1px solid #ddd;
12
+ padding: 8px;
13
+ text-align: center;
14
+ }
15
+ .best {
16
+ font-weight: bold;
17
+ text-decoration: underline;
18
+ }
19
+ </style>
20
+
21
+ <div style="text-align: center; margin: 20px auto; padding: 20px; border: 3px solid #ddd; border-radius: 10px;">
22
+ <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
23
+ <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌎 OuteAI.com</a>
24
+ <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">🤝 Join our Discord</a>
25
+ <a href="https://x.com/OuteAI" target="_blank">𝕏 @OuteAI</a>
26
+ </div>
27
+
28
+ # OuteTTS-0.1-350M
29
+
30
+ ## Model Description
31
+
32
+ OuteTTS-0.1-350M is a novel text-to-speech synthesis model that leverages pure language modeling without external adapters or complex architectures, built upon the LLaMa architecture using our Oute3-350M-DEV base model, it demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.
33
+
34
+ ## Key Features
35
+
36
+ - Pure language modeling approach to TTS
37
+ - Voice cloning capabilities
38
+ - LLaMa architecture
39
+ - Compatible with llama.cpp and GGUF format
40
+
41
+ ## Technical Details
42
+
43
+ The model utilizes a three-step approach to audio processing:
44
+ 1. Audio tokenization using WavTokenizer (processing 75 tokens per second)
45
+ 2. CTC forced alignment for precise word-to-audio token mapping
46
+ 3. Structured prompt creation following the format:
47
+ ```
48
+ [full transcription]
49
+ [word] [duration token] [audio tokens]
50
+ ```
51
+
52
+ ## Technical Blog
53
+ https://www.outeai.com/blog/OuteTTS-0.1-350M
54
+
55
+ ## Limitations
56
+
57
+ - Vocabulary constraints due to training data limitations
58
+ - String-only input support
59
+ - Given its compact 350M parameter size, the model may frequently alter, insert, or omit wrong words, leading to variations in output quality.
60
+ - Variable temperature sensitivity depending on use case
61
+ - Performs best with shorter sentences, as accuracy may decrease with longer inputs.
62
+
63
+ ### Speech Samples
64
+
65
+ Listen to samples generated by OuteTTS-0.1-350M:
66
+
67
+ <div style="margin-top: 20px;">
68
+ <table style="width: 100%; border-collapse: collapse;">
69
+ <thead>
70
+ <tr>
71
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Input</th>
72
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Audio</th>
73
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Notes</th>
74
+ </tr>
75
+ </thead>
76
+ <tbody>
77
+ <tr>
78
+ <td style="border: 1px solid #ddd; padding: 8px;">Hello, I can speak pretty well, but sometimes I make some mistakes.</td>
79
+ <td style="border: 1px solid #ddd; padding: 8px;">
80
+ <audio controls style="width: 100%;">
81
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/2.wav" type="audio/wav">
82
+ Your browser does not support the audio element.
83
+ </audio>
84
+ <audio controls style="width: 100%;">
85
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/1.wav" type="audio/wav">
86
+ Your browser does not support the audio element.
87
+ </audio>
88
+ </td>
89
+ <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1)</td>
90
+ </tr>
91
+ <tr>
92
+ <td style="border: 1px solid #ddd; padding: 8px;">Once upon a time, there was a</td>
93
+ <td style="border: 1px solid #ddd; padding: 8px;">
94
+ <audio controls style="width: 100%;">
95
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/3.wav" type="audio/wav">
96
+ Your browser does not support the audio element.
97
+ </audio>
98
+ </td>
99
+ <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1)</td>
100
+ </tr>
101
+ <tr>
102
+ <td style="border: 1px solid #ddd; padding: 8px;">Scientists have discovered a new planet that may be capable of supporting life!</td>
103
+ <td style="border: 1px solid #ddd; padding: 8px;">
104
+ <audio controls style="width: 100%;">
105
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/4.wav" type="audio/wav">
106
+ Your browser does not support the audio element.
107
+ </audio>
108
+ </td>
109
+ <td style="border: 1px solid #ddd; padding: 8px;">The model partially failed to follow the input text. (temperature=0.1, repetition_penalty=1.1) </td>
110
+ </tr>
111
+ <tr>
112
+ <td style="border: 1px solid #ddd; padding: 8px;">Scientists have discovered a new planet that may be capable of supporting life!</td>
113
+ <td style="border: 1px solid #ddd; padding: 8px;">
114
+ <audio controls style="width: 100%;">
115
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/5.wav" type="audio/wav">
116
+ Your browser does not support the audio element.
117
+ </audio>
118
+ </td>
119
+ <td style="border: 1px solid #ddd; padding: 8px;">In this case, changing to a higher temperature from 0.1 to 0.7 produces more consistent output. (temperature=0.7, repetition_penalty=1.1)</td>
120
+ </tr>
121
+ </tbody>
122
+ </table>
123
+ </div>
124
+
125
+ ## Installation
126
+ https://github.com/outeai/OuteTTS
127
+
128
+ ```bash
129
+ pip install outetts
130
+ ```
131
+
132
+ ## Usage
133
+
134
+ ### Interface Usage
135
+ ```python
136
+ from outetts.v0_1.interface import InterfaceHF, InterfaceGGUF
137
+
138
+ # Initialize the interface with the Hugging Face model
139
+ interface = InterfaceHF("OuteAI/OuteTTS-0.1-350M")
140
+
141
+ # Or initialize the interface with a GGUF model
142
+ # interface = InterfaceGGUF("path/to/model.gguf")
143
+
144
+ # Generate TTS output
145
+ # Without a speaker reference, the model generates speech with random speaker characteristics
146
+ output = interface.generate(
147
+ text="Hello, am I working?",
148
+ temperature=0.1,
149
+ repetition_penalty=1.1,
150
+ max_lenght=4096
151
+ )
152
+
153
+ # Play the generated audio
154
+ output.play()
155
+
156
+ # Save the generated audio to a file
157
+ output.save("output.wav")
158
+ ```
159
+
160
+ ### Voice Cloning
161
+ ```python
162
+ # Create a custom speaker from an audio file
163
+ speaker = interface.create_speaker(
164
+ "path/to/reference.wav",
165
+ "reference text matching the audio"
166
+ )
167
+
168
+ # Generate TTS with the custom voice
169
+ output = interface.generate(
170
+ text="This is a cloned voice speaking",
171
+ speaker=speaker,
172
+ temperature=0.1,
173
+ repetition_penalty=1.1,
174
+ max_lenght=4096
175
+ )
176
+ ```
177
+
178
+ ## Model Details
179
+ - **Model Type:** LLaMa-based language model
180
+ - **Size:** 350M parameters
181
+ - **Language Support:** English
182
+ - **License:** CC BY 4.0
183
+ - **Speech Datasets Used:**
184
+ - LibriTTS-R (CC BY 4.0)
185
+ - Multilingual LibriSpeech (MLS) (CC BY 4.0)
186
+
187
+ ## Future Improvements
188
+ - Scaling up parameters and training data
189
+ - Exploring alternative alignment methods for better character compatibility
190
+ - Potential expansion into speech-to-speech assistant models
191
+
192
+ ## Credits
193
+
194
+ - WavTokenizer: https://github.com/jishengpeng/WavTokenizer
195
+ - CTC Forced Alignment: https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html
196
+
197
+ ## Disclaimer
198
+ By using this model, you acknowledge that you understand and assume the risks associated with its use.
199
+ You are solely responsible for ensuring compliance with all applicable laws and regulations.
200
+ We disclaim any liability for problems arising from the use of this open-source model, including but not limited to direct, indirect, incidental, consequential, or punitive damages.
201
+ We make no warranties, express or implied, regarding the model's performance, accuracy, or fitness for a particular purpose. Your use of this model is at your own risk, and you agree to hold harmless and indemnify us, our affiliates, and our contributors from any claims, damages, or expenses arising from your use of the model.