Best resources for fine tuning
Hello - I love your model and was interested in carrying out some fine tuning (for a particular voice which I like that I would like to use for poetry reading).
I found this resource https://github.com/SWivid/F5-TTS/tree/main/src/f5_tts/train and I just wanted to check that this was the best resource out there for this purpose? The reason I was asking was that I was a bit of a beginner in this field so didn't want to miss out on any other useful blog posts or similar that might make things easier - otherwise I will start working through this.
Thanks so much for this - it looks like a great project and I was so impressed with the results when I tried the one shot example (for reading the first few lines of The Wasteland - quite challenging I guess).
Hi,
Thanks for your interest in the model! Yes, the official GitHub repo is probably the best place for fine tuning instructions. You can also check out this Discussion:
https://github.com/SWivid/F5-TTS/discussions/57
Feel free to reach out if you have any questions!
Great - thanks for coming back and I'll give it a go. I'm new to this so it may take a while but I have plenty of time on my hands and I'll let you know I manage to produce anything worthwhile..!
Hi. I'm kind of new in using these tools. I don't know if this is the correct place to talk about this, so I apologize in advance.
When I use the command you have provided under "Usage" It kind of just loads some base settings and generates a sentence (somewhere) so I couldn't achieve anything with that, but instead I just used the "Custom" option in Gradio to direct it to your model and vocab and It kind of works (I think), but the quality of the voice is significantly worse than the original model.
Am I doing something wrong or is it because your model is still unfinished? Or maybe I'm supposed to do some type of fine tuning, etc to bring it to the level of the other one?
@voidshaper13 yes, finetuning will likely improve the quality of the model on your specific voice. Feel free to check out the F5-TTS repo for finetuning instructions!
Hi @mrfakename ,
First of all, thank you for your hard work and for making this model commercially usable as well!
I have a few questions regarding the model. I’m currently fine-tuning it on another language. It works well, but the results are not quite as good as F5-TTS. I wanted to clarify a few points:
- This is a PyTorch model, but the original F5-TTS is a safetensors model and much smaller (pruned version). How can I prune it? The original F5-TTS training code expects a pruned model, and if I pass the PyTorch model directly, I get errors. Should I copy over the
ema_model_state_dict
or themodel_state_dict
? - Which F5-TTS variant is this model based on? There are several (
F5TTS_Base, F5TTS_Base_bigvgan, F5TTS_v1_Base
). I suspect it’sF5TTS_v1_Base
. Is that correct? - The vocabulary in your model is much larger than in the original F5-TTS. Is there a reason for this?
- Do you have any tips or best practices for fine-tuning (e.g., learning rate, LR scheduler, optimal batch size, audio preprocessing)?
Additionally, in the Hugging Face Space for OpenF5-TTS (mrfakename/OpenF5-TTS
), the model ID is different from this one (mrfakename/openf5-v2
). It seems like that repository is private, do you plan to make it public?
Lastly, I noticed the README has some TODO items listed. Is the project still actively maintained, and is there a rough timeline for those planned features?
Thanks again for your time and effort!
Hi @mrfakename ,
First of all, thank you for your hard work and for making this model commercially usable as well!
I have a few questions regarding the model. I’m currently fine-tuning it on another language. It works well, but the results are not quite as good as F5-TTS. I wanted to clarify a few points:
- This is a PyTorch model, but the original F5-TTS is a safetensors model and much smaller (pruned version). How can I prune it? The original F5-TTS training code expects a pruned model, and if I pass the PyTorch model directly, I get errors. Should I copy over the
ema_model_state_dict
or themodel_state_dict
?- Which F5-TTS variant is this model based on? There are several (
F5TTS_Base, F5TTS_Base_bigvgan, F5TTS_v1_Base
). I suspect it’sF5TTS_v1_Base
. Is that correct?- The vocabulary in your model is much larger than in the original F5-TTS. Is there a reason for this?
- Do you have any tips or best practices for fine-tuning (e.g., learning rate, LR scheduler, optimal batch size, audio preprocessing)?
Additionally, in the Hugging Face Space for OpenF5-TTS (
mrfakename/OpenF5-TTS
), the model ID is different from this one (mrfakename/openf5-v2
). It seems like that repository is private, do you plan to make it public?Lastly, I noticed the README has some TODO items listed. Is the project still actively maintained, and is there a rough timeline for those planned features?
Thanks again for your time and effort!
He trained it on English only v1 base, so its better for english, but worse for Chinese.