my-north-ai
/

whisper-medium-pt

@@ -1,170 +1,133 @@
 ---
 library_name: transformers
-tags:
-- asr
-- portuguese
-- whisper
-license: apache-2.0
-datasets:
-- mozilla-foundation/common_voice_16_1
-- facebook/multilingual_librispeech
-language:
-- pt
-metrics:
-- wer
-- cer
 ---
-# Model Card for Model ID
-Finetuned version of Whisper-Medium. The data used is composed by CV_16_1, MLS(upercased with ponctuation), Bracarense (Portuguese Data from Braga),
-and 150 hours of synthetic data generated by MT4LarveV2 on the Capes Dataset
 ## Model Details
-Required memory +-3,5gbs
-## Model Evaluation on the test sets
-#### MLS
-Word Error Rate (WER): 0.07953937639618663
-Character Error Rate (CER): 0.025810003764192974
-#### FLEURS
-Word Error Rate (WER): 0.0665782614660021
-Character Error Rate (CER): 0.034512877593296847
-#### CV_16_1
-Word Error Rate (WER): 0.07678002555021912
-Character Error Rate (CER): 0.02895734904986346
-#### BRACARENSE
-Word Error Rate (WER): 0.22612900931654933
-Character Error Rate (CER): 0.13239836737487914
-#### Model Evaluation on the validation set
-Word Error Rate (WER): 0.13854374474800024
-Character Error Rate (CER): 0.06738064374696189
-### Model Description
-Finetuned Model with a new methodology of generating synthetic data from an external Large Audio Model, 4GPUS A10G were used in the process of fine tuning for +-10 hours.
-- **Developed by:** Yuriy Perezhohin & Tiago Moco Santos
-- **Funded by :** MyNorth AI
-- **Model type:** ASR
-- **Language(s) (NLP):** PT
-- **License:** Apache 2.0
-- **Finetuned from model:** openai/whisper-medium
-## Uses
-Intended use on portuguese audios, audios longer than 30 seconds are splitted during the generation process.
 ## How to Get Started with the Model
-Use the code below to get started with the model.
 ```
-from transformers import WhisperProcessor
-from transformers import WhisperForConditionalGeneration
-import librosa
-filename= "path_to_audioFile"
-array_audio, sr = librosa.load(filename, sr= 16_000)
-model_name= "my-north-ai/whisper-medium-pt"
-processor = WhisperProcessor.from_pretrained(model_name)
-#forced_decoder_ids = processor.get_decoder_prompt_ids(language="portuguese", task="transcribe")
-model = WhisperForConditionalGeneration.from_pretrained(model_name)
-model.eval()
-input_features = processor(
-    array_audio, sampling_rate=sr, return_tensors="pt"
-).input_features
-predicted_ids = model.generate(input_features)
-transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
-transcription[0]
-```
-### Training Data
-**Common Voice**: @mozilla-foundation/common_voice_16_1
-**MLS**: @facebook/multilingual_librispeech
-**Bracarense**: https://vlo.clarin.eu/record/https_58__47__47_hdl.handle.net_47_21.11129_47_0000-000D-F928-E;jsessionid=B7C7C6C8A3DCCB278B4B66EF51516056?0
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-**output_dir**=checkpoint_folder,
-**gradient_accumulation_steps**=4,
-**per_device_train_batch_size**=8,
-**per_device_eval_batch_size**=16,
-**learning_rate**=1e-6,
-**warmup_ratio**=0.05,
-**gradient_checkpointing**=True,
-**fp16**=True,
-**num_train_epochs**= 3,
-**evaluation_strategy=**"steps",
-**generation_max_length**=448,
-**predict_with_generate**=True,
-**save_steps**=int(len(train_dataset) * 3 / (32 * 4) / 10),
-**eval_steps**=int(len(train_dataset) * 3 / (32 * 4) / 10),
-**logging_steps**=10,
-**report_to**=["mlflow"],
-**load_best_model_at_end**=True,
-**gradient_checkpointing_kwargs**={"use_reentrant": False},
-**metric_for_best_model**="were",
-**greater_is_better**=False,
-**push_to_hub**=False
 ## Environmental Impact
-- **Hardware Type:** 4* A10G
-- **Hours used:** 10
-- **Cloud Provider:** AWS
-- **Compute Region:** eu-central-1
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Model Card Authors
-[More Information Needed]
-@yuriyvnv
-@tiagomosantos

 ---
 library_name: transformers
+tags: [automatic-speech-recognition, contrastive-learning, synthetic-data-filtering]
 ---
+# Model Card for Finetuned Version of Whisper-Small
+This model was trained on a subset of the synthetically generated data that later on was filtered to increase the performance of Whisper Model.
+The approach involves aligning representations of synthetic audio and corresponding text transcripts to identify and remove low-quality samples, improving the overall training data quality
+In this Specific Model we used 96,08% of synthetic data generated by SeamllesMT4LargeV2, the rest was removed by the filtering model.
+The training set also contained, the CommonVoice Dataset, Multilibri Speach, and Bracarense (Fully Portuguese Dialect)
 ## Model Details
+- **Developed by:** Yuriy Perezhohin, Tiago Santos, Victor Costa, Fernando Peres, and Mauro Castelli.
+- **Funded by:**  MyNorth AI Research
+- **Shared by:** MyNorth AI Research
+- **Model type:** ASR with contrastive learning-based synthetic data filtering
+- **Language:** Portuguese
+- **License:** APACHE 2.0
+- **Finetuned from model:** Whisper Small
+### Model Sources
+- **Repository:** https://github.com/my-north-ai/semantic_audio_filtering
+- **Paper:** Comming Soon
+## Uses
+This model can be directly used for improving ASR systems in Portuguese, particularly in scenarios with limited real-world data or unique linguistic characteristics.
+### Out-of-Scope Use
+The model is not suitable for tasks involving languages other than Portuguese without additional fine-tuning and data adjustments.
+## Bias, Risks, and Limitations
+Users should be aware of potential biases introduced by synthetic data and ensure the quality of the data aligns with the target application's requirements.
+It is recommended to evaluate the model's performance on diverse datasets to identify and mitigate biases.
 ## How to Get Started with the Model
 ```
+from transformers import pipeline
+model = pipeline("automatic-speech-recognition", model="my-north-ai/semantic_audio_filtering")
+result = model("path_to_audio_file.wav")
+print(result)
+```
+## Training Details
+### Training Data
+The training data includes 140 hours of synthetically generated Portuguese speech and transcripts, along with real data from the Multilingual LibriSpeech Corpus (MLS), Common Voice (CV) 16.1, and the Perfil Sociolinguístico da Fala Bracarense (PSFB) dataset
+### Training Procedure
+The model was fine tuned using DDP methodolgy across 4 A10g GPUS
+#### Preprocessin
+The preprocessing steps include text normalization, removal of special characters, and ensuring consistent formatting for TTS generation.
+#### Training Hyperparameters
+- **Training regime:** fp16 mixed precision
+- **Learning Rate:** 1e-5
+- **Batch Size:** 32
+- **Epochs** 3
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+The testing data includes subsets from the FLEURS dataset and PSFB, chosen for their linguistic diversity and unique speech patterns.
+#### Metrics
+## Evaluation Results
+### Word Error Rate (WER) Comparison
+| Model Size | Model Type      | WER (Normalized) | WER (Non-Normalized) |
+|------------|-----------------|------------------|-----------------------|
+| Small      | Pretrained      | 10.87            | 15.43                 |
+| Small      | FS-17.68%       | 10.45            | 18.57                 |
+| Small      | FS-3.92%        | 10.34            | 18.53                 |
+| Small      | FS-0.24%        | 10.58            | 18.90                 |
+| Small      | Zero Synthetic  | 10.90            | 19.32                 |
+---------------------------------------------------------------------------
+| Medium     | Pretrained      | 8.62             | 12.65                 |
+| Medium     | FS-17.68%       | 6.58             | 14.46                 |
+| Medium     | FS-3.92%        | 6.57             | 14.44                 |
+| Medium     | FS-0.24%        | 6.58             | 14.54                 |
+| Medium     | Zero Synthetic  | 6.97             | 14.74                 |
+---------------------------------------------------------------------------
+| Large V3   | Pretrained      | 7.70             | 11.78                 |
+| Large V3   | FS-17.68%       | 4.73             | 10.83                 |
+| Large V3   | FS-3.92%        | 4.65             | 11.09                 |
+| Large V3   | FS-0.24%        | 4.80             | 11.28                 |
+| Large V3   | Zero Synthetic  | 4.86             | 10.92                 |
+---------------------------------------------------------------------------
 ## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** NVIDIA A10G
+- **Hours used:** 15
+- **Cloud Provider:** AWS
+- **Compute Region:** US EAST