File size: 4,233 Bytes
383180b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5e56b1
 
 
 
 
 
 
 
6213802
cbfa777
 
 
4a5329b
3251ec3
d5e56b1
6213802
 
fa5cb43
6213802
d5e56b1
6213802
d5e56b1
3251ec3
d5e56b1
cbfa777
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43120cf
cbfa777
 
 
d5e56b1
3251ec3
d5e56b1
 
 
44b2307
 
 
 
 
d5e56b1
 
 
fa5cb43
44b2307
d5e56b1
 
 
cbfa777
44b2307
cbfa777
d5e56b1
cbfa777
d5e56b1
fa5cb43
d5e56b1
fa5cb43
cbfa777
 
 
 
 
 
 
 
 
 
d5e56b1
3251ec3
dfb6fcc
 
 
 
 
 
 
 
3251ec3
d5e56b1
6213802
4a5329b
 
 
 
 
 
 
 
6213802
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_15_0
language:
- fr
metrics:
- wer
base_model:
- LeBenchmark/wav2vec2-FR-7K-large
pipeline_tag: automatic-speech-recognition
library_name: speechbrain
tags:
- Transformer
- wav2vec2
- CTC
- inference
---

# asr-wav2vec2-commonvoice-15-fr : LeBenchmark/wav2vec2-FR-7K-large fine-tuned on CommonVoice 15.0 French

<!-- Provide a quick summary of what the model is/does. -->

*asr-wav2vec2-commonvoice-15-fr* is an Automatic Speech Recognition model fine-tuned on CommonVoice 15.0 French set with *LeBenchmark/wav2vec2-FR-7K-large* as the pretrained wav2vec2 model.

The fine-tuned model achieves the following performance :
| Release | Valid WER | Test WER | GPUs | Epochs
|:-------------:|:--------------:|:--------------:| :--------:|:--------:|
| 2023-09-08 | 9.14  | 11.21  | 4xV100 32GB | 30 |

## 📝 Model Details

The ASR system is composed of:
- the **Tokenizer** (char) that transforms the input text into a sequence of characters ("cat" into ["c", "a", "t"]) and trained with the train transcriptions (train.tsv).
- the **Acoustic model** (wav2vec2.0 + DNN + CTC greedy decode). The pretrained wav2vec 2.0 model [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large) is combined with two DNN layers and fine-tuned on CommonVoice FR.
The final acoustic representation is given to the CTC greedy decode.

We used recordings sampled at 16kHz (single channel).

## 💻 How to transcribe a file with the model

### Install and import speechbrain

```bash
pip install speechbrain
```

```python
from speechbrain.inference.ASR import EncoderASR
```

### Pipeline

```python
def transcribe(audio, model):
    return model.transcribe_file(audio).lower()


def save_transcript(transcript, audio, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(f"{audio}\t{transcript}\n")


def main():
    model = EncoderASR.from_hparams("Propicto/asr-wav2vec2-commonvoice-15-fr", savedir="tmp/")
    transcript = transcribe(audio, model)
    save_transcript(transcript, audio, "out.txt")
```

## ⚙️ Training Details

### Training Data

We use the train / valid / test splits provided by CommonVoice, which corresponds to:
| | Train | Valid | Test |
|:-------------:|:-------------:|:--------------:|:--------------:|
| # utterances | 527,554 | 16,132 | 16,132 |
| # hours | 756.19 | 25.84 | 26.11 |

### Training Procedure

We follow the training procedure provided in the [ASR-CTC speechbrain recipe](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC).
The `common_voice_prepare.py` script handles the preprocessing of the dataset.

#### Training Hyperparameters

Refer to the hyperparams.yaml file to get the hyperparameters information.

#### Training time

With 4xV100 32GB, the training took ~ 81 hours.

#### Libraries

[Speechbrain](https://speechbrain.github.io/):
```bibtex
@misc{SB2021,
    author = {Ravanelli, Mirco and Parcollet, Titouan and Rouhe, Aku and Plantinga, Peter and Rastorgueva, Elena and Lugosch, Loren and Dawalatabad, Nauman and Ju-Chieh, Chou and Heba, Abdel and Grondin, Francois and Aris, William and Liao, Chien-Feng and Cornell, Samuele and Yeh, Sung-Lin and Na, Hwidong and Gao, Yan and Fu, Szu-Wei and Subakan, Cem and De Mori, Renato and Bengio, Yoshua },
    title = {SpeechBrain},
    year = {2021},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\\\\url{https://github.com/speechbrain/speechbrain}},
  }
```

## 💡 Information

- **Developed by:** Cécile Macaire
- **Funded by [optional]:** GENCI-IDRIS (Grant 2023-AD011013625R1)
PROPICTO ANR-20-CE93-0005
- **Language(s) (NLP):** French
- **License:** Apache-2.0
- **Finetuned from model:** LeBenchmark/wav2vec2-FR-7K-large

## 📌 Citation

```bibtex
@inproceedings{macaire24_interspeech,
  title     = {Towards Speech-to-Pictograms Translation},
  author    = {Cécile Macaire and Chloé Dion and Didier Schwab and Benjamin Lecouteux and Emmanuelle Esperança-Rodier},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {857--861},
  doi       = {10.21437/Interspeech.2024-490},
  issn      = {2958-1796},
}
```