Update README.md
Browse files
README.md
CHANGED
@@ -4,20 +4,28 @@ language:
|
|
4 |
- en
|
5 |
- zh
|
6 |
- ta
|
|
|
|
|
|
|
7 |
---
|
8 |
|
9 |
# Malaysian Finetune Whisper Small
|
10 |
|
11 |
Finetune Whisper Small on [Malaysian STT Whisper](https://huggingface.co/datasets/mesolitica/malaysian-stt-whisper)
|
12 |
|
13 |
-
WanDB at https://wandb.ai/huseinzol05/malaysian-whisper-small-v2, **still on training**
|
14 |
-
|
15 |
## Improvement
|
16 |
|
17 |
1. Distilled from Whisper Large V3 on Malaysian and Science context.
|
18 |
2. Better translation for Malay, Manglish, Mandarin, Tamil and Science context.
|
19 |
3. Word level timestamp, introduced `<|transcribeprecise|>` token, **a new task!**
|
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
## how to
|
22 |
|
23 |
Load the model,
|
@@ -99,4 +107,4 @@ tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(r.sequences[0
|
|
99 |
<|startoftranscript|><|ms|><|transcribeprecise|><|0.02|> Assembly<|1.20|><|1.56|> on<|1.64|><|1.74|> Aging<|2.04|><|2.14|> di<|2.22|><|2.26|> Vienna<|2.50|><|2.72|> Australia<|3.12|><|4.26|> yang<|4.38|><|4.42|> telah<|4.58|><|4.62|> diadakan<|5.08|><|5.16|> pada<|5.30|><|5.36|> tahun<|5.60|><|5.62|> 1982<|6.92|><|7.12|> dan<|7.24|><|7.32|> berasaskan<|7.88|><|7.98|> unjuran<|8.36|><|8.42|> tersebut<|8.80|><|8.88|> maka<|9.06|><|9.12|> Jabatan<|9.48|><|9.56|> Perangkaan<|9.98|><|10.04|> Malaysia<|10.36|><|10.84|> menganggarkan<|11.56|><|11.98|> menjelang<|12.34|><|12.40|> tahun<|12.64|><|12.66|> 2035<|14.08|><|14.50|> sejumlah<|14.96|><|14.98|> 15%<|16.14|><|16.26|> penduduk<|16.62|><|16.68|> kita<|16.90|><|17.02|> adalah<|17.30|><|17.40|> daripada<|17.80|><|17.86|> kalangan<|18.16|><|18.22|> warga<|18.40|><|18.46|> emas.<|18.68|><|19.24|> Untuk<|19.40|><|19.46|> makluman<|19.86|><|20.64|> Tuan<|20.76|><|20.82|> Yang<|20.90|><|20.94|> Pertua<|21.14|><|21.20|> dan<|21.28|><|21.34|> juga<|21.50|><|21.58|> Alia<|21.70|><|21.76|> Mbah<|21.88|><|21.92|> Ahmad,<|22.08|><|22.22|> pembangunan<|22.66|><|22.72|> sistem<|23.00|><|23.06|> pendaftaran<|23.48|><|23.54|> warga<|23.72|><|23.78|> emas<|23.98|><|24.06|> ataupun<|24.36|><|24.42|> kita<|24.56|><|24.62|> sebutkan<|24.94|><|25.08|> event<|25.38|><|25.86|> adalah<|26.10|><|26.18|> usaha<|26.46|><|26.60|> kerajaan<|27.06|><|27.16|> kearah<|27.44|><|27.50|> merealisasikan<|28.36|><|28.86|> objektif<|29.36|><|29.42|> yang<|29.52|><|29.56|> telah<|29.72|><|29.76|> digarakan<|30.00|><|endoftext|>
|
100 |
```
|
101 |
|
102 |
-
**Make sure you already monkey patched `tokenization_whisper.TASK_IDS = ["translate", "transcribe", "transcribeprecise"]` at starting of your script**.
|
|
|
4 |
- en
|
5 |
- zh
|
6 |
- ta
|
7 |
+
datasets:
|
8 |
+
- mesolitica/Malaysian-STT-Whisper
|
9 |
+
- malaysia-ai/STT-Whisper
|
10 |
---
|
11 |
|
12 |
# Malaysian Finetune Whisper Small
|
13 |
|
14 |
Finetune Whisper Small on [Malaysian STT Whisper](https://huggingface.co/datasets/mesolitica/malaysian-stt-whisper)
|
15 |
|
|
|
|
|
16 |
## Improvement
|
17 |
|
18 |
1. Distilled from Whisper Large V3 on Malaysian and Science context.
|
19 |
2. Better translation for Malay, Manglish, Mandarin, Tamil and Science context.
|
20 |
3. Word level timestamp, introduced `<|transcribeprecise|>` token, **a new task!**
|
21 |
|
22 |
+
## how we finetuned it?
|
23 |
+
|
24 |
+
We done 2 phases,
|
25 |
+
|
26 |
+
1. Finetune on [mesolitica/Malaysian-STT-Whisper](https://huggingface.co/datasets/mesolitica/Malaysian-STT-Whisper), WanDB at https://wandb.ai/huseinzol05/malaysian-whisper-small-v2
|
27 |
+
2. Annealing on 5% from [mesolitica/Malaysian-STT-Whisper](https://huggingface.co/datasets/mesolitica/Malaysian-STT-Whisper) and 100% from [malaysia-ai/STT-Whisper](https://huggingface.co/datasets/malaysia-ai/STT-Whisper), **still on training**
|
28 |
+
|
29 |
## how to
|
30 |
|
31 |
Load the model,
|
|
|
107 |
<|startoftranscript|><|ms|><|transcribeprecise|><|0.02|> Assembly<|1.20|><|1.56|> on<|1.64|><|1.74|> Aging<|2.04|><|2.14|> di<|2.22|><|2.26|> Vienna<|2.50|><|2.72|> Australia<|3.12|><|4.26|> yang<|4.38|><|4.42|> telah<|4.58|><|4.62|> diadakan<|5.08|><|5.16|> pada<|5.30|><|5.36|> tahun<|5.60|><|5.62|> 1982<|6.92|><|7.12|> dan<|7.24|><|7.32|> berasaskan<|7.88|><|7.98|> unjuran<|8.36|><|8.42|> tersebut<|8.80|><|8.88|> maka<|9.06|><|9.12|> Jabatan<|9.48|><|9.56|> Perangkaan<|9.98|><|10.04|> Malaysia<|10.36|><|10.84|> menganggarkan<|11.56|><|11.98|> menjelang<|12.34|><|12.40|> tahun<|12.64|><|12.66|> 2035<|14.08|><|14.50|> sejumlah<|14.96|><|14.98|> 15%<|16.14|><|16.26|> penduduk<|16.62|><|16.68|> kita<|16.90|><|17.02|> adalah<|17.30|><|17.40|> daripada<|17.80|><|17.86|> kalangan<|18.16|><|18.22|> warga<|18.40|><|18.46|> emas.<|18.68|><|19.24|> Untuk<|19.40|><|19.46|> makluman<|19.86|><|20.64|> Tuan<|20.76|><|20.82|> Yang<|20.90|><|20.94|> Pertua<|21.14|><|21.20|> dan<|21.28|><|21.34|> juga<|21.50|><|21.58|> Alia<|21.70|><|21.76|> Mbah<|21.88|><|21.92|> Ahmad,<|22.08|><|22.22|> pembangunan<|22.66|><|22.72|> sistem<|23.00|><|23.06|> pendaftaran<|23.48|><|23.54|> warga<|23.72|><|23.78|> emas<|23.98|><|24.06|> ataupun<|24.36|><|24.42|> kita<|24.56|><|24.62|> sebutkan<|24.94|><|25.08|> event<|25.38|><|25.86|> adalah<|26.10|><|26.18|> usaha<|26.46|><|26.60|> kerajaan<|27.06|><|27.16|> kearah<|27.44|><|27.50|> merealisasikan<|28.36|><|28.86|> objektif<|29.36|><|29.42|> yang<|29.52|><|29.56|> telah<|29.72|><|29.76|> digarakan<|30.00|><|endoftext|>
|
108 |
```
|
109 |
|
110 |
+
**Make sure you already monkey patched `tokenization_whisper.TASK_IDS = ["translate", "transcribe", "transcribeprecise"]` at starting of your script**.
|