β οΈ Initial Checkpoint
This is a Piper TTS model finetuned from Kristin medium
This model is after just 5 epochs on ~30% of total data I curated (synthetic + natural).
Currently, I'm refining the synthetic dataset as I'm not satisfied with its quality. I will resume finetuning after.
Also running ablations on the best ratio of synthetic and natural data.
From initial observations it seems like its better to use majority of one kind (90%-10%).
Trying to push the boundaries of audio generated by a mere 63 MB model.
Inference
import wave
from src.python_run.piper import PiperVoice # Or import from the installed package if you used pip
model = PiperVoice.load("en_US-ceylia-medium.onnx")
text = "I have a big plan for today. It involves fine-tuning you."
with wave.open("output.wav", "wb") as output_file:
output_file.setnchannels(1)
output_file.setsampwidth(2)
output_file.setframerate(22050)
model.synthesize(text=text, wav_file=output_file, sentence_silence=0.25)
π Acknowledgements
Bryce Beattie for training the Kristin model.
Reference Audio from datasets by @Jinsaryko