Commit
·
d5ea348
1
Parent(s):
4b774d8
usage, performance
Browse files
README.md
CHANGED
@@ -5,4 +5,68 @@ datasets:
|
|
5 |
language:
|
6 |
- uk
|
7 |
library_name: nemo
|
8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
language:
|
6 |
- uk
|
7 |
library_name: nemo
|
8 |
+
---
|
9 |
+
|
10 |
+
## Usage
|
11 |
+
|
12 |
+
The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
13 |
+
|
14 |
+
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
|
15 |
+
|
16 |
+
```
|
17 |
+
pip install nemo_toolkit['all']
|
18 |
+
```
|
19 |
+
|
20 |
+
### Automatically instantiate the model
|
21 |
+
|
22 |
+
```python
|
23 |
+
from nemo.collections.asr.models import EncDecCTCModelBPE
|
24 |
+
asr_model = EncDecCTCModelBPE.from_pretrained("taras-sereda/uk-pods-conformer")
|
25 |
+
```
|
26 |
+
|
27 |
+
### Transcribing using Python
|
28 |
+
First, let's get a sample
|
29 |
+
```
|
30 |
+
wget "https://huggingface.co/datasets/taras-sereda/uk-pods/resolve/main/example/e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav?download=true" -O e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav
|
31 |
+
```
|
32 |
+
Then simply do:
|
33 |
+
```
|
34 |
+
asr_model.transcribe(['e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav'])
|
35 |
+
```
|
36 |
+
|
37 |
+
### Input
|
38 |
+
|
39 |
+
This model accepts 16000 kHz Mono-channel Audio (wav files) as input.
|
40 |
+
|
41 |
+
### Output
|
42 |
+
|
43 |
+
This model provides transcribed speech as a string for a given audio sample.
|
44 |
+
|
45 |
+
## Model Architecture
|
46 |
+
|
47 |
+
Conformer-CTC model is a non-autoregressive variant of Conformer model [2] for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here: [Conformer-CTC Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc).
|
48 |
+
|
49 |
+
|
50 |
+
|
51 |
+
### Datasets
|
52 |
+
|
53 |
+
This model has been trained using a combination of 2 datasets:
|
54 |
+
|
55 |
+
- UK-PODS [3] train dataset: This dataset comprises 46 hours of conversational speech collected from Ukrainian podcasts.
|
56 |
+
- Validated Mozilla Common Voice Corpus 10.0: (excluding dev and test data) dataset that includes 50.1 hours of Ukrainian speech.
|
57 |
+
|
58 |
+
## Performance
|
59 |
+
|
60 |
+
Performances of the ASR model is reported in terms of Word Error Rate (WER) with greedy decoding.
|
61 |
+
|
62 |
+
| Tokenizer | Vocabulary Size | UK-PODS test | MCV-10 test |
|
63 |
+
|:-------------:| :--------------: | :----------: | :---------: |
|
64 |
+
| SentencePiece | 1024 | 0.093 | 0.116 |
|
65 |
+
|
66 |
+
## References
|
67 |
+
|
68 |
+
- [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
69 |
+
|
70 |
+
- [2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
|
71 |
+
|
72 |
+
- [3] [UK-PODS](https://huggingface.co/datasets/taras-sereda/uk-pods)
|