diarray commited on
Commit
23c848f
·
verified ·
1 Parent(s): ff03d79

Push model using huggingface_hub.

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +121 -0
  3. stt-bm-quartznet15x5.nemo +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ stt-bm-quartznet15x5.nemo filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bm
4
+ library_name: nemo
5
+ datasets:
6
+ - RobotsMali/bam-asr-all
7
+
8
+ thumbnail: null
9
+ tags:
10
+ - automatic-speech-recognition
11
+ - speech
12
+ - audio
13
+ - CTC
14
+ - QuartzNet
15
+ - pytorch
16
+ - Bambara
17
+ - NeMo
18
+ license: cc-by-4.0
19
+ base_model: stt_fr_quartznet15x5
20
+ model-index:
21
+ - name: stt-bm-quartznet15x5
22
+ results:
23
+ - task:
24
+ name: Automatic Speech Recognition
25
+ type: automatic-speech-recognition
26
+ dataset:
27
+ name: bam-asr-all
28
+ type: RobotsMali/bam-asr-all
29
+ split: test
30
+ args:
31
+ language: bm
32
+ metrics:
33
+ - name: Test WER
34
+ type: wer
35
+ value: 46.5
36
+
37
+ metrics:
38
+ - wer
39
+ pipeline_tag: automatic-speech-recognition
40
+ ---
41
+
42
+ # QuartzNet 15x5 CTC Bambara
43
+
44
+ <style>
45
+ img {
46
+ display: inline;
47
+ }
48
+ </style>
49
+
50
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-QuartzNet-lightgrey#model-badge)](#model-architecture)
51
+ | [![Model size](https://img.shields.io/badge/Params-19M-lightgrey#model-badge)](#model-architecture)
52
+ | [![Language](https://img.shields.io/badge/Language-bm-lightgrey#model-badge)](#datasets)
53
+
54
+ `stt-bm-quartznet15x5` is a fine-tuned version of NVIDIA’s [`stt_fr_quartznet15x5`](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_fr_quartznet15x5) optimized for **Bambara ASR**. This model cannot write **Punctuations and Capitalizations**, it utilizes a character encoding scheme, and transcribes text in the standard character set that is provided in the training set of bam-asr-all dataset.
55
+
56
+ The model was fine-tuned using **NVIDIA NeMo** and is trained with **CTC (Connectionist Temporal Classification) Loss**.
57
+
58
+ ## NVIDIA NeMo: Training
59
+
60
+ To fine-tune or use the model, install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend installing it after setting up the latest PyTorch version.
61
+
62
+ ```bash
63
+ pip install nemo_toolkit['asr']
64
+ ```
65
+
66
+ ## How to Use This Model
67
+
68
+ ### Load Model with NeMo
69
+ ```python
70
+ import nemo.collections.asr as nemo_asr
71
+ asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="RobotsMali/stt-bm-quartznet15x5")
72
+ ```
73
+
74
+ ### Transcribe Audio
75
+ ```python
76
+ # Assuming you have a test audio file named sample_audio.wav
77
+ asr_model.transcribe(['sample_audio.wav'])
78
+ ```
79
+
80
+ ### Input
81
+
82
+ This model accepts **16 kHz mono-channel audio (wav files)** as input.
83
+
84
+ ### Output
85
+
86
+ This model provides transcribed speech as a string for a given speech sample.
87
+
88
+ ## Model Architecture
89
+
90
+ QuartzNet is a convolutional architecture, which consists of **1D time-channel separable convolutions** optimized for speech recognition. More information on QuartzNet can be found here: [QuartzNet Model](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#quartznet).
91
+
92
+ ## Training
93
+
94
+ The NeMo toolkit was used to fine-tune this model for **25939 steps** over the `stt_fr_quartznet15x5` model. This model is trained with this [base config](https://github.com/diarray-hub/bambara-asr/blob/main/configs/quartznet-20m-config-v2.yaml). The full training configurations, scripts, and experimental logs are available here:
95
+
96
+ 🔗 [Bambara-ASR Experiments](https://github.com/diarray-hub/bambara-asr)
97
+
98
+ ## Dataset
99
+ This model was fine-tuned on the [bam-asr-all](https://huggingface.co/datasets/RobotsMali/bam-asr-all) dataset, which consists of **37 hours of transcribed Bambara speech data**. The dataset is primarily derived from **Jeli-ASR dataset** (~87%).
100
+
101
+ ## Performance
102
+
103
+ The performance of Automatic Speech Recognition models is measured using **Word Error Rate (WER%)**.
104
+
105
+ |**Version**|**Tokenizer**|**Vocabulary Size**|**bam-asr-all (test set)**|
106
+ |---------|-----------------------|-----------------|---------|
107
+ | V2 | Character-wise | 45 | 46.5 |
108
+
109
+ These are **greedy WER numbers without external LM**.
110
+
111
+ ## License
112
+ This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license.
113
+
114
+ ---
115
+
116
+ More details are available in the **Experimental Technical Report**:
117
+ 📄 [Draft Technical Report - Weights & Biases](https://wandb.ai/yacoudiarra-wl/bam-asr-nemo-training/reports/Draft-Technical-Report-V1--VmlldzoxMTIyOTMzOA).
118
+
119
+ Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/diarray-hub/bambara-asr/issues) on GitHub if you have any contributions.
120
+
121
+ ---
stt-bm-quartznet15x5.nemo ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ed228132a92c5bf804011a9a443b23fcc8f751b859859ace501063b2aad8737
3
+ size 76400640