nvidia/sr_ssl_flowmatching_16k_430m · Minimal script to finetune the model on given dataset

10 days ago

Hi there,

I'm trying to fine-tune your model for a speech enhancement task with a strong bandwidth extension component (on the Vibravox dataset).

So far, I've incorporated the FlowMatchingAudioToAudioModel into my repo by simply calling the following methods in my project's LightningModule, using NeMo only to load the model. (I know nested LightningModule is not the best pratice...). I tried to stick as much as possible to your configuration with my own data.

from nemo.collections.audio.models import AudioToAudioModel
model = AudioToAudioModel.from_pretrained('nvidia/sr_ssl_flowmatching_16k_430m')

# in the __init__
optimizer(model.parameters())

# compute loss in the training_step
loss = model._step(target_signal=target_signal, input_signal=input_signal, input_length=input_length)

# enhance signal in the validation_step
output_signal, _ = self.model.forward(input_signal=input_signal, input_length=input_length)

But I got bad results. Even in a simple overfitting test, the loss does not go straight to zero. Did I miss something in my integration? Or do you think that body-conducted speech is too much of an OOD for efficient fine-tuning? Otherwise, could you show me a minimal script to fine-tune your model on a huggingface dataset?

Best regards,

jhauret

10 days ago

•

edited 2 days ago

Had a conversation with Pin-Jui Ku! To sum up:

No obvious bugs in my implementation, but the pre-training in English only data might not be suitable for fine-tuning in French.
Try to run the model without pre-training in my repo
Try to replicate the run with the original NeMo training script and build themanifest_filepath as a json file with noisy_filepath and clean_filepath keys for each element.

Thanks again 🙏

EDIT: I've finally achieved very good results with your model ! Both starting from scratch or from pretrained. The key was to train longer ( the 5k steps described in the finetuning stage of the article are just for the warm-up but we should also add max_steps: 50000 in the CosineAnnealing config )

jhauret changed discussion status to closed 10 days ago