NeMo
Ukrainian
taras-sereda commited on
Commit
d5ea348
·
1 Parent(s): 4b774d8

usage, performance

Browse files
Files changed (1) hide show
  1. README.md +65 -1
README.md CHANGED
@@ -5,4 +5,68 @@ datasets:
5
  language:
6
  - uk
7
  library_name: nemo
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  language:
6
  - uk
7
  library_name: nemo
8
+ ---
9
+
10
+ ## Usage
11
+
12
+ The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
13
+
14
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
15
+
16
+ ```
17
+ pip install nemo_toolkit['all']
18
+ ```
19
+
20
+ ### Automatically instantiate the model
21
+
22
+ ```python
23
+ from nemo.collections.asr.models import EncDecCTCModelBPE
24
+ asr_model = EncDecCTCModelBPE.from_pretrained("taras-sereda/uk-pods-conformer")
25
+ ```
26
+
27
+ ### Transcribing using Python
28
+ First, let's get a sample
29
+ ```
30
+ wget "https://huggingface.co/datasets/taras-sereda/uk-pods/resolve/main/example/e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav?download=true" -O e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav
31
+ ```
32
+ Then simply do:
33
+ ```
34
+ asr_model.transcribe(['e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav'])
35
+ ```
36
+
37
+ ### Input
38
+
39
+ This model accepts 16000 kHz Mono-channel Audio (wav files) as input.
40
+
41
+ ### Output
42
+
43
+ This model provides transcribed speech as a string for a given audio sample.
44
+
45
+ ## Model Architecture
46
+
47
+ Conformer-CTC model is a non-autoregressive variant of Conformer model [2] for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here: [Conformer-CTC Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc).
48
+
49
+
50
+
51
+ ### Datasets
52
+
53
+ This model has been trained using a combination of 2 datasets:
54
+
55
+ - UK-PODS [3] train dataset: This dataset comprises 46 hours of conversational speech collected from Ukrainian podcasts.
56
+ - Validated Mozilla Common Voice Corpus 10.0: (excluding dev and test data) dataset that includes 50.1 hours of Ukrainian speech.
57
+
58
+ ## Performance
59
+
60
+ Performances of the ASR model is reported in terms of Word Error Rate (WER) with greedy decoding.
61
+
62
+ | Tokenizer | Vocabulary Size | UK-PODS test | MCV-10 test |
63
+ |:-------------:| :--------------: | :----------: | :---------: |
64
+ | SentencePiece | 1024 | 0.093 | 0.116 |
65
+
66
+ ## References
67
+
68
+ - [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
69
+
70
+ - [2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
71
+
72
+ - [3] [UK-PODS](https://huggingface.co/datasets/taras-sereda/uk-pods)