Reference Audio

#32
by JoshJarabek - opened

Is there any ability to pass it reference audio or build a voice library with it, or is fine-tuning the only option?

I am interested using reference audio as well. Sometimes it is hard to describe the voice you want exactly how it should be.

It really doesn't work too well with snac_24khz based models (like maya and orpheus).

Try putting a reference audio file through this and you'll see what I mean: https://huggingface.co/spaces/Gapeleon/snac_test

But if you want to try it:

  1. Get a reference audio file that sounds decent after running through that ^ space
  2. Truncate the audio to just 1 sentence, less 3-8 seconds would be best.
  3. Send the reference audio to gemini-3-pro, along with the example prompts from the maya1 demo space, and tell Gemini to produce a description in for the attached audio, in the same format as the examples.
  4. Use snac_encoder to encode your reference audio
  5. Send maya1 the [description] + [transcript of reference audio] + [ new text you want to say]

For step 4, make sure you're using the proper format, and make sure the transcript ends with a full-stop.

The idea is, the model will "continue" generating in the same voice as the reference audio. See here for my implementation with Kani-TTS

https://huggingface.co/spaces/Gapeleon/KaniTTS_Voice_Cloning/blob/main/app.py

But nemo is a much better codec than snac for this.

You'll know what I mean if you try putting your reference audio through any other codec besides snac: https://huggingface.co/collections/Gapeleon/audio-codec-demos

Sign up or log in to comment