Spaces:
Running
on
T4
Just a few questions about voice cloning, audio samples, etc.
Just curious, is there a (maximum) limit to how long the source audio sample can be?
For example, should the source audio sample always be less than 30 seconds or less, or can you go up to a 180+ second audio sample for the prompt?
Do longer audio samples make a difference in output quality when voice cloning?
Can the cloned voice be saved for future Text-To-Speech Generations, or would the audio sample need to be presented every time a text-to-speech request is made?
Great questions. I'd also like to know.
Great questions!
Regarding the first question, we suggest using audio prompts shorter than 10 seconds, as the majority of the training data is composed of such examples. Addionally, aim for better generation, ideally within the range of one to three sentences.
For the second questions, currently the model operates in a zero-shot manner. Saving clones would require further development to store extracted global tokens, semantic tokens, and prompt text.