Issues with CSM-1B Finetuning and Inference: Fixed 10s Output Duration & Slow Speaking Rate
Hi,
I recently finetuned the csm-1b model using the provided Colab notebook with a custom dataset. My dataset includes the following columns: ['audio', 'text', 'speaker_id'].
Here’s what I’ve done:
- All audio samples are between 1 and 10 seconds, as the preprocess_example() function sets "max_length": 240001 (i.e., roughly 10 seconds at 24kHz), so I filtered my dataset accordingly.
- Finetuning proceeded without any major issues.
However, I'm encountering the following problems during inference:
Issues
- Fixed 10-second output duration:
Regardless of the text input length (short or long), the model always generates audio that is exactly 10 seconds long.
For short text inputs, the output contains the correct audio followed by noise/silence.
For long text inputs, the speech gets truncated, ending abruptly at the 10-second mark. - Slow speaking rate:
The generated speech has a noticeably slow speaking pace, which isn’t ideal for natural conversations or general usability.
Questions
- Is there a way to generate audio longer than 10 seconds using csm-1b? If so, how should I modify the training or inference setup?
- How can I adjust the speaking rate to make it faster or more natural?
- Is the fixed 10-second limit due to the model architecture, the dataset preprocessing, or inference parameters?
@dhatta if you saw our guide, you need to set the max tokens to more than 125 if you want it to be longer than 10 seconds: https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning
"If you notice that the output duration reaches a maximum of 10 seconds, increasemax_new_tokens = 125 from its default value of 125. Since 125 tokens corresponds to 10 seconds of audio, you'll need to set a higher value for longer outputs."
- To make it more natural will depend on your actual dataset unfortunately :(
@shimmyshimmer
Thank you for your response, I updated unsloth_zoo and did sample training, now it is generating less than 10 seconds audio correctly without any noise.
Is there a way to manage the speaking rate during inference?