Automatic Speech Recognition
Transformers
Safetensors
Portuguese
whisper
contrastive-learning
synthetic-data-filtering
Inference Endpoints
yuriyvnv commited on
Commit
fc442c0
·
verified ·
1 Parent(s): 1157e7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -111
README.md CHANGED
@@ -1,170 +1,133 @@
1
  ---
2
  library_name: transformers
3
- tags:
4
- - asr
5
- - portuguese
6
- - whisper
7
- license: apache-2.0
8
- datasets:
9
- - mozilla-foundation/common_voice_16_1
10
- - facebook/multilingual_librispeech
11
- language:
12
- - pt
13
- metrics:
14
- - wer
15
- - cer
16
  ---
17
 
18
- # Model Card for Model ID
 
 
 
 
 
19
 
20
- Finetuned version of Whisper-Medium. The data used is composed by CV_16_1, MLS(upercased with ponctuation), Bracarense (Portuguese Data from Braga),
21
- and 150 hours of synthetic data generated by MT4LarveV2 on the Capes Dataset
22
 
23
 
24
  ## Model Details
25
- Required memory +-3,5gbs
26
 
27
- ## Model Evaluation on the test sets
28
- #### MLS
29
- Word Error Rate (WER): 0.07953937639618663
30
- Character Error Rate (CER): 0.025810003764192974
 
 
 
31
 
32
- #### FLEURS
33
- Word Error Rate (WER): 0.0665782614660021
34
- Character Error Rate (CER): 0.034512877593296847
35
 
36
- #### CV_16_1
37
- Word Error Rate (WER): 0.07678002555021912
38
- Character Error Rate (CER): 0.02895734904986346
39
 
40
- #### BRACARENSE
41
- Word Error Rate (WER): 0.22612900931654933
42
- Character Error Rate (CER): 0.13239836737487914
43
 
44
- #### Model Evaluation on the validation set
45
- Word Error Rate (WER): 0.13854374474800024
46
- Character Error Rate (CER): 0.06738064374696189
47
 
48
- ### Model Description
49
- Finetuned Model with a new methodology of generating synthetic data from an external Large Audio Model, 4GPUS A10G were used in the process of fine tuning for +-10 hours.
50
 
51
- - **Developed by:** Yuriy Perezhohin & Tiago Moco Santos
52
- - **Funded by :** MyNorth AI
53
- - **Model type:** ASR
54
- - **Language(s) (NLP):** PT
55
- - **License:** Apache 2.0
56
- - **Finetuned from model:** openai/whisper-medium
57
 
 
58
 
59
- ## Uses
60
-
61
- Intended use on portuguese audios, audios longer than 30 seconds are splitted during the generation process.
62
 
 
63
 
 
 
64
 
65
  ## How to Get Started with the Model
66
-
67
- Use the code below to get started with the model.
68
  ```
69
 
70
- from transformers import WhisperProcessor
71
- from transformers import WhisperForConditionalGeneration
72
- import librosa
73
 
74
- filename= "path_to_audioFile"
75
- array_audio, sr = librosa.load(filename, sr= 16_000)
 
 
76
 
 
77
 
78
- model_name= "my-north-ai/whisper-medium-pt"
79
- processor = WhisperProcessor.from_pretrained(model_name)
80
- #forced_decoder_ids = processor.get_decoder_prompt_ids(language="portuguese", task="transcribe")
81
- model = WhisperForConditionalGeneration.from_pretrained(model_name)
82
- model.eval()
83
 
 
84
 
85
- input_features = processor(
86
- array_audio, sampling_rate=sr, return_tensors="pt"
87
- ).input_features
88
 
89
- predicted_ids = model.generate(input_features)
90
 
91
- transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
92
 
93
- transcription[0]
94
 
95
- ```
96
 
97
- ### Training Data
98
 
99
- **Common Voice**: @mozilla-foundation/common_voice_16_1
100
- **MLS**: @facebook/multilingual_librispeech
101
- **Bracarense**: https://vlo.clarin.eu/record/https_58__47__47_hdl.handle.net_47_21.11129_47_0000-000D-F928-E;jsessionid=B7C7C6C8A3DCCB278B4B66EF51516056?0
 
102
 
103
 
 
104
 
105
 
106
- #### Preprocessing [optional]
107
 
108
- [More Information Needed]
109
 
 
110
 
111
- #### Training Hyperparameters
112
- **output_dir**=checkpoint_folder,
113
- **gradient_accumulation_steps**=4,
114
- **per_device_train_batch_size**=8,
115
- **per_device_eval_batch_size**=16,
116
- **learning_rate**=1e-6,
117
 
118
- **warmup_ratio**=0.05,
119
- **gradient_checkpointing**=True,
120
 
121
- **fp16**=True,
122
- **num_train_epochs**= 3,
123
- **evaluation_strategy=**"steps",
124
- **generation_max_length**=448,
125
- **predict_with_generate**=True,
126
- **save_steps**=int(len(train_dataset) * 3 / (32 * 4) / 10),
127
 
128
- **eval_steps**=int(len(train_dataset) * 3 / (32 * 4) / 10),
129
- **logging_steps**=10,
130
- **report_to**=["mlflow"],
131
 
132
- **load_best_model_at_end**=True,
133
- **gradient_checkpointing_kwargs**={"use_reentrant": False},
134
- **metric_for_best_model**="were",
135
 
136
- **greater_is_better**=False,
137
- **push_to_hub**=False
138
 
 
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
 
142
 
143
 
144
  ## Environmental Impact
145
 
146
- - **Hardware Type:** 4* A10G
147
- - **Hours used:** 10
148
- - **Cloud Provider:** AWS
149
- - **Compute Region:** eu-central-1
150
-
151
-
152
-
153
- ## Citation [optional]
154
 
155
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
156
 
157
- **BibTeX:**
158
-
159
- [More Information Needed]
160
-
161
- **APA:**
162
-
163
- [More Information Needed]
164
 
165
 
166
- ## Model Card Authors
167
 
168
- [More Information Needed]
169
- @yuriyvnv
170
- @tiagomosantos
 
1
  ---
2
  library_name: transformers
3
+ tags: [automatic-speech-recognition, contrastive-learning, synthetic-data-filtering]
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
+ # Model Card for Finetuned Version of Whisper-Small
7
+
8
+ This model was trained on a subset of the synthetically generated data that later on was filtered to increase the performance of Whisper Model.
9
+ The approach involves aligning representations of synthetic audio and corresponding text transcripts to identify and remove low-quality samples, improving the overall training data quality
10
+ In this Specific Model we used 96,08% of synthetic data generated by SeamllesMT4LargeV2, the rest was removed by the filtering model.
11
+ The training set also contained, the CommonVoice Dataset, Multilibri Speach, and Bracarense (Fully Portuguese Dialect)
12
 
 
 
13
 
14
 
15
  ## Model Details
 
16
 
17
+ - **Developed by:** Yuriy Perezhohin, Tiago Santos, Victor Costa, Fernando Peres, and Mauro Castelli.
18
+ - **Funded by:** MyNorth AI Research
19
+ - **Shared by:** MyNorth AI Research
20
+ - **Model type:** ASR with contrastive learning-based synthetic data filtering
21
+ - **Language:** Portuguese
22
+ - **License:** APACHE 2.0
23
+ - **Finetuned from model:** Whisper Small
24
 
25
+ ### Model Sources
 
 
26
 
 
 
 
27
 
28
+ - **Repository:** https://github.com/my-north-ai/semantic_audio_filtering
29
+ - **Paper:** Comming Soon
 
30
 
31
+ ## Uses
 
 
32
 
33
+ This model can be directly used for improving ASR systems in Portuguese, particularly in scenarios with limited real-world data or unique linguistic characteristics.
 
34
 
 
 
 
 
 
 
35
 
36
+ ### Out-of-Scope Use
37
 
38
+ The model is not suitable for tasks involving languages other than Portuguese without additional fine-tuning and data adjustments.
 
 
39
 
40
+ ## Bias, Risks, and Limitations
41
 
42
+ Users should be aware of potential biases introduced by synthetic data and ensure the quality of the data aligns with the target application's requirements.
43
+ It is recommended to evaluate the model's performance on diverse datasets to identify and mitigate biases.
44
 
45
  ## How to Get Started with the Model
 
 
46
  ```
47
 
48
+ from transformers import pipeline
 
 
49
 
50
+ model = pipeline("automatic-speech-recognition", model="my-north-ai/semantic_audio_filtering")
51
+ result = model("path_to_audio_file.wav")
52
+ print(result)
53
+ ```
54
 
55
+ ## Training Details
56
 
57
+ ### Training Data
 
 
 
 
58
 
59
+ The training data includes 140 hours of synthetically generated Portuguese speech and transcripts, along with real data from the Multilingual LibriSpeech Corpus (MLS), Common Voice (CV) 16.1, and the Perfil Sociolinguístico da Fala Bracarense (PSFB) dataset
60
 
61
+ ### Training Procedure
 
 
62
 
63
+ The model was fine tuned using DDP methodolgy across 4 A10g GPUS
64
 
65
+ #### Preprocessin
66
 
67
+ The preprocessing steps include text normalization, removal of special characters, and ensuring consistent formatting for TTS generation.
68
 
 
69
 
70
+ #### Training Hyperparameters
71
 
72
+ - **Training regime:** fp16 mixed precision
73
+ - **Learning Rate:** 1e-5
74
+ - **Batch Size:** 32
75
+ - **Epochs** 3
76
 
77
 
78
+ ## Evaluation
79
 
80
 
81
+ ### Testing Data, Factors & Metrics
82
 
 
83
 
84
+ #### Testing Data
85
 
86
+ The testing data includes subsets from the FLEURS dataset and PSFB, chosen for their linguistic diversity and unique speech patterns.
 
 
 
 
 
87
 
 
 
88
 
89
+ #### Metrics
 
 
 
 
 
90
 
 
 
 
91
 
 
 
 
92
 
93
+ ## Evaluation Results
 
94
 
95
+ ### Word Error Rate (WER) Comparison
96
 
97
+ | Model Size | Model Type | WER (Normalized) | WER (Non-Normalized) |
98
+ |------------|-----------------|------------------|-----------------------|
99
+ | Small | Pretrained | 10.87 | 15.43 |
100
+ | Small | FS-17.68% | 10.45 | 18.57 |
101
+ | Small | FS-3.92% | 10.34 | 18.53 |
102
+ | Small | FS-0.24% | 10.58 | 18.90 |
103
+ | Small | Zero Synthetic | 10.90 | 19.32 |
104
+ ---------------------------------------------------------------------------
105
+ | Medium | Pretrained | 8.62 | 12.65 |
106
+ | Medium | FS-17.68% | 6.58 | 14.46 |
107
+ | Medium | FS-3.92% | 6.57 | 14.44 |
108
+ | Medium | FS-0.24% | 6.58 | 14.54 |
109
+ | Medium | Zero Synthetic | 6.97 | 14.74 |
110
+ ---------------------------------------------------------------------------
111
+ | Large V3 | Pretrained | 7.70 | 11.78 |
112
+ | Large V3 | FS-17.68% | 4.73 | 10.83 |
113
+ | Large V3 | FS-3.92% | 4.65 | 11.09 |
114
+ | Large V3 | FS-0.24% | 4.80 | 11.28 |
115
+ | Large V3 | Zero Synthetic | 4.86 | 10.92 |
116
+ ---------------------------------------------------------------------------
117
 
118
 
119
 
120
 
121
  ## Environmental Impact
122
 
123
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
 
 
 
 
 
124
 
125
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
126
 
127
+ - **Hardware Type:** NVIDIA A10G
128
+ - **Hours used:** 15
129
+ - **Cloud Provider:** AWS
130
+ - **Compute Region:** US EAST
 
 
 
131
 
132
 
 
133