Spaces:

Mobvoi
/

Offical-Spark-TTS

Running on T4

App Files Files Community

Just a few questions about voice cloning, audio samples, etc.

by ArthurParkerhouse - opened 5 days ago

Discussion

ArthurParkerhouse

5 days ago

Just curious, is there a (maximum) limit to how long the source audio sample can be?

For example, should the source audio sample always be less than 30 seconds or less, or can you go up to a 180+ second audio sample for the prompt?

Do longer audio samples make a difference in output quality when voice cloning?

Can the cloned voice be saved for future Text-To-Speech Generations, or would the audio sample need to be presented every time a text-to-speech request is made?

PsykoTroniK

3 days ago

Great questions. I'd also like to know.

hashest

出门问问信息科技有限公司 org 1 day ago

Great questions!

Regarding the first question, we suggest using audio prompts shorter than 10 seconds, as the majority of the training data is composed of such examples. Addionally, aim for better generation, ideally within the range of one to three sentences.

For the second questions, currently the model operates in a zero-shot manner. Saving clones would require further development to store extracted global tokens, semantic tokens, and prompt text.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment