|
--- |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- taras-sereda/uk-pods |
|
language: |
|
- uk |
|
library_name: nemo |
|
--- |
|
|
|
## Usage |
|
|
|
The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. |
|
|
|
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version. |
|
|
|
``` |
|
pip install nemo_toolkit['all'] |
|
``` |
|
|
|
### Automatically instantiate the model |
|
|
|
```python |
|
from nemo.collections.asr.models import EncDecCTCModelBPE |
|
asr_model = EncDecCTCModelBPE.from_pretrained("taras-sereda/uk-pods-conformer") |
|
``` |
|
|
|
### Transcribing using Python |
|
First, let's get a sample |
|
``` |
|
wget "https://huggingface.co/datasets/taras-sereda/uk-pods/resolve/main/example/e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav?download=true" -O e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav |
|
``` |
|
Then simply do: |
|
``` |
|
asr_model.transcribe(['e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav']) |
|
``` |
|
|
|
### Input |
|
|
|
This model accepts 16000 kHz Mono-channel Audio (wav files) as input. |
|
|
|
### Output |
|
|
|
This model provides transcribed speech as a string for a given audio sample. |
|
|
|
## Model Architecture |
|
|
|
Conformer-CTC model is a non-autoregressive variant of Conformer model [2] for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here: [Conformer-CTC Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc). |
|
|
|
|
|
|
|
### Datasets |
|
|
|
This model has been trained using a combination of 2 datasets: |
|
|
|
- UK-PODS [3] train dataset: This dataset comprises 46 hours of conversational speech collected from Ukrainian podcasts. |
|
- Validated Mozilla Common Voice Corpus 10.0: (excluding dev and test data) dataset that includes 50.1 hours of Ukrainian speech. |
|
|
|
## Performance |
|
|
|
Performances of the ASR model is reported in terms of Word Error Rate (WER) with greedy decoding. |
|
|
|
| Tokenizer | Vocabulary Size | UK-PODS test | MCV-10 test | |
|
|:-------------:| :--------------: | :----------: | :---------: | |
|
| SentencePiece | 1024 | 0.093 | 0.116 | |
|
|
|
## References |
|
|
|
- [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) |
|
|
|
- [2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100) |
|
|
|
- [3] [UK-PODS](https://huggingface.co/datasets/taras-sereda/uk-pods) |
|
|