taras-sereda
/

uk-pods-conformer

Model card Files Files and versions Community

taras-sereda commited on Jun 17, 2024

Commit

d5ea348

·

1 Parent(s): 4b774d8

usage, performance

Files changed (1) hide show

README.md +65 -1

README.md CHANGED Viewed

@@ -5,4 +5,68 @@ datasets:
 language:
 - uk
 library_name: nemo
----

 language:
 - uk
 library_name: nemo
+---
+## Usage
+The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
+To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
+```
+pip install nemo_toolkit['all']
+```
+### Automatically instantiate the model
+```python
+from nemo.collections.asr.models import EncDecCTCModelBPE
+asr_model = EncDecCTCModelBPE.from_pretrained("taras-sereda/uk-pods-conformer")
+```
+### Transcribing using Python
+First, let's get a sample
+```
+wget "https://huggingface.co/datasets/taras-sereda/uk-pods/resolve/main/example/e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav?download=true" -O e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav
+```
+Then simply do:
+```
+asr_model.transcribe(['e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav'])
+```
+### Input
+This model accepts 16000 kHz Mono-channel Audio (wav files) as input.
+### Output
+This model provides transcribed speech as a string for a given audio sample.
+## Model Architecture
+Conformer-CTC model is a non-autoregressive variant of Conformer model [2] for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here: [Conformer-CTC Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc).
+### Datasets
+This model has been trained using a combination of 2 datasets:
+    - UK-PODS [3] train dataset: This dataset comprises 46 hours of conversational speech collected from Ukrainian podcasts.
+    - Validated Mozilla Common Voice Corpus 10.0: (excluding dev and test data) dataset that includes 50.1 hours of Ukrainian speech.
+## Performance
+Performances of the ASR model is reported in terms of Word Error Rate (WER) with greedy decoding.
+| Tokenizer     | Vocabulary Size  | UK-PODS test | MCV-10 test |
+|:-------------:| :--------------: | :----------: | :---------: |
+| SentencePiece | 1024             | 0.093        | 0.116       |
+## References
+- [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
+- [2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
+- [3] [UK-PODS](https://huggingface.co/datasets/taras-sereda/uk-pods)