BeebekBhz commited on
Commit
6c6898b
·
verified ·
1 Parent(s): 30a514c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -97
README.md CHANGED
@@ -17,7 +17,7 @@ tags:
17
  - xlsr-fine-tuning-week
18
  license: apache-2.0
19
  model-index:
20
- - name: XLSR Wav2Vec2 English by Jonatas Grosman
21
  results:
22
  - task:
23
  name: Automatic Speech Recognition
@@ -59,6 +59,7 @@ model-index:
59
  - name: Dev CER (+LM)
60
  type: cer
61
  value: 11.01
 
62
  ---
63
 
64
  # Fine-tuned XLSR-53 large model for speech recognition in English
@@ -66,100 +67,4 @@ model-index:
66
  Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on English using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice).
67
  When using this model, make sure that your speech input is sampled at 16kHz.
68
 
69
- This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)
70
 
71
- The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
72
-
73
- ## Usage
74
-
75
- The model can be used directly (without a language model) as follows...
76
-
77
- Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:
78
-
79
- ```python
80
- from huggingsound import SpeechRecognitionModel
81
-
82
- model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
83
- audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
84
-
85
- transcriptions = model.transcribe(audio_paths)
86
- ```
87
-
88
- Writing your own inference script:
89
-
90
- ```python
91
- import torch
92
- import librosa
93
- from datasets import load_dataset
94
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
95
-
96
- LANG_ID = "en"
97
- MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english"
98
- SAMPLES = 10
99
-
100
- test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
101
-
102
- processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
103
- model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
104
-
105
- # Preprocessing the datasets.
106
- # We need to read the audio files as arrays
107
- def speech_file_to_array_fn(batch):
108
- speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
109
- batch["speech"] = speech_array
110
- batch["sentence"] = batch["sentence"].upper()
111
- return batch
112
-
113
- test_dataset = test_dataset.map(speech_file_to_array_fn)
114
- inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
115
-
116
- with torch.no_grad():
117
- logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
118
-
119
- predicted_ids = torch.argmax(logits, dim=-1)
120
- predicted_sentences = processor.batch_decode(predicted_ids)
121
-
122
- for i, predicted_sentence in enumerate(predicted_sentences):
123
- print("-" * 100)
124
- print("Reference:", test_dataset[i]["sentence"])
125
- print("Prediction:", predicted_sentence)
126
- ```
127
-
128
- | Reference | Prediction |
129
- | ------------- | ------------- |
130
- | "SHE'LL BE ALL RIGHT." | SHE'LL BE ALL RIGHT |
131
- | SIX | SIX |
132
- | "ALL'S WELL THAT ENDS WELL." | ALL AS WELL THAT ENDS WELL |
133
- | DO YOU MEAN IT? | DO YOU MEAN IT |
134
- | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS. | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESSION |
135
- | HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE? | HOW IS MOSLILLAR GOING TO HANDLE ANDBEWOOTH HIS LIKE Q AND Q |
136
- | "I GUESS YOU MUST THINK I'M KINDA BATTY." | RUSTIAN WASTIN PAN ONTE BATTLY |
137
- | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING? | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING |
138
- | SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER. | SAUCE FOR THE GUICE IS SAUCE FOR THE GONDER |
139
- | GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD. | GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD |
140
-
141
- ## Evaluation
142
-
143
- 1. To evaluate on `mozilla-foundation/common_voice_6_0` with split `test`
144
-
145
- ```bash
146
- python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset mozilla-foundation/common_voice_6_0 --config en --split test
147
- ```
148
-
149
- 2. To evaluate on `speech-recognition-community-v2/dev_data`
150
-
151
- ```bash
152
- python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0
153
- ```
154
-
155
- ## Citation
156
- If you want to cite this model you can use this:
157
-
158
- ```bibtex
159
- @misc{grosman2021xlsr53-large-english,
160
- title={Fine-tuned {XLSR}-53 large model for speech recognition in {E}nglish},
161
- author={Grosman, Jonatas},
162
- howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english}},
163
- year={2021}
164
- }
165
- ```
 
17
  - xlsr-fine-tuning-week
18
  license: apache-2.0
19
  model-index:
20
+ - name: XLSR Wav2Vec2 English
21
  results:
22
  - task:
23
  name: Automatic Speech Recognition
 
59
  - name: Dev CER (+LM)
60
  type: cer
61
  value: 11.01
62
+ library_name: transformers
63
  ---
64
 
65
  # Fine-tuned XLSR-53 large model for speech recognition in English
 
67
  Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on English using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice).
68
  When using this model, make sure that your speech input is sampled at 16kHz.
69
 
 
70