Safetensors
whisper
huseinzol05 commited on
Commit
de84a67
·
verified ·
1 Parent(s): e9b987f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -3
README.md CHANGED
@@ -4,20 +4,28 @@ language:
4
  - en
5
  - zh
6
  - ta
 
 
 
7
  ---
8
 
9
  # Malaysian Finetune Whisper Small
10
 
11
  Finetune Whisper Small on [Malaysian STT Whisper](https://huggingface.co/datasets/mesolitica/malaysian-stt-whisper)
12
 
13
- WanDB at https://wandb.ai/huseinzol05/malaysian-whisper-small-v2, **still on training**
14
-
15
  ## Improvement
16
 
17
  1. Distilled from Whisper Large V3 on Malaysian and Science context.
18
  2. Better translation for Malay, Manglish, Mandarin, Tamil and Science context.
19
  3. Word level timestamp, introduced `<|transcribeprecise|>` token, **a new task!**
20
 
 
 
 
 
 
 
 
21
  ## how to
22
 
23
  Load the model,
@@ -99,4 +107,4 @@ tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(r.sequences[0
99
  <|startoftranscript|><|ms|><|transcribeprecise|><|0.02|> Assembly<|1.20|><|1.56|> on<|1.64|><|1.74|> Aging<|2.04|><|2.14|> di<|2.22|><|2.26|> Vienna<|2.50|><|2.72|> Australia<|3.12|><|4.26|> yang<|4.38|><|4.42|> telah<|4.58|><|4.62|> diadakan<|5.08|><|5.16|> pada<|5.30|><|5.36|> tahun<|5.60|><|5.62|> 1982<|6.92|><|7.12|> dan<|7.24|><|7.32|> berasaskan<|7.88|><|7.98|> unjuran<|8.36|><|8.42|> tersebut<|8.80|><|8.88|> maka<|9.06|><|9.12|> Jabatan<|9.48|><|9.56|> Perangkaan<|9.98|><|10.04|> Malaysia<|10.36|><|10.84|> menganggarkan<|11.56|><|11.98|> menjelang<|12.34|><|12.40|> tahun<|12.64|><|12.66|> 2035<|14.08|><|14.50|> sejumlah<|14.96|><|14.98|> 15%<|16.14|><|16.26|> penduduk<|16.62|><|16.68|> kita<|16.90|><|17.02|> adalah<|17.30|><|17.40|> daripada<|17.80|><|17.86|> kalangan<|18.16|><|18.22|> warga<|18.40|><|18.46|> emas.<|18.68|><|19.24|> Untuk<|19.40|><|19.46|> makluman<|19.86|><|20.64|> Tuan<|20.76|><|20.82|> Yang<|20.90|><|20.94|> Pertua<|21.14|><|21.20|> dan<|21.28|><|21.34|> juga<|21.50|><|21.58|> Alia<|21.70|><|21.76|> Mbah<|21.88|><|21.92|> Ahmad,<|22.08|><|22.22|> pembangunan<|22.66|><|22.72|> sistem<|23.00|><|23.06|> pendaftaran<|23.48|><|23.54|> warga<|23.72|><|23.78|> emas<|23.98|><|24.06|> ataupun<|24.36|><|24.42|> kita<|24.56|><|24.62|> sebutkan<|24.94|><|25.08|> event<|25.38|><|25.86|> adalah<|26.10|><|26.18|> usaha<|26.46|><|26.60|> kerajaan<|27.06|><|27.16|> kearah<|27.44|><|27.50|> merealisasikan<|28.36|><|28.86|> objektif<|29.36|><|29.42|> yang<|29.52|><|29.56|> telah<|29.72|><|29.76|> digarakan<|30.00|><|endoftext|>
100
  ```
101
 
102
- **Make sure you already monkey patched `tokenization_whisper.TASK_IDS = ["translate", "transcribe", "transcribeprecise"]` at starting of your script**.
 
4
  - en
5
  - zh
6
  - ta
7
+ datasets:
8
+ - mesolitica/Malaysian-STT-Whisper
9
+ - malaysia-ai/STT-Whisper
10
  ---
11
 
12
  # Malaysian Finetune Whisper Small
13
 
14
  Finetune Whisper Small on [Malaysian STT Whisper](https://huggingface.co/datasets/mesolitica/malaysian-stt-whisper)
15
 
 
 
16
  ## Improvement
17
 
18
  1. Distilled from Whisper Large V3 on Malaysian and Science context.
19
  2. Better translation for Malay, Manglish, Mandarin, Tamil and Science context.
20
  3. Word level timestamp, introduced `<|transcribeprecise|>` token, **a new task!**
21
 
22
+ ## how we finetuned it?
23
+
24
+ We done 2 phases,
25
+
26
+ 1. Finetune on [mesolitica/Malaysian-STT-Whisper](https://huggingface.co/datasets/mesolitica/Malaysian-STT-Whisper), WanDB at https://wandb.ai/huseinzol05/malaysian-whisper-small-v2
27
+ 2. Annealing on 5% from [mesolitica/Malaysian-STT-Whisper](https://huggingface.co/datasets/mesolitica/Malaysian-STT-Whisper) and 100% from [malaysia-ai/STT-Whisper](https://huggingface.co/datasets/malaysia-ai/STT-Whisper), **still on training**
28
+
29
  ## how to
30
 
31
  Load the model,
 
107
  <|startoftranscript|><|ms|><|transcribeprecise|><|0.02|> Assembly<|1.20|><|1.56|> on<|1.64|><|1.74|> Aging<|2.04|><|2.14|> di<|2.22|><|2.26|> Vienna<|2.50|><|2.72|> Australia<|3.12|><|4.26|> yang<|4.38|><|4.42|> telah<|4.58|><|4.62|> diadakan<|5.08|><|5.16|> pada<|5.30|><|5.36|> tahun<|5.60|><|5.62|> 1982<|6.92|><|7.12|> dan<|7.24|><|7.32|> berasaskan<|7.88|><|7.98|> unjuran<|8.36|><|8.42|> tersebut<|8.80|><|8.88|> maka<|9.06|><|9.12|> Jabatan<|9.48|><|9.56|> Perangkaan<|9.98|><|10.04|> Malaysia<|10.36|><|10.84|> menganggarkan<|11.56|><|11.98|> menjelang<|12.34|><|12.40|> tahun<|12.64|><|12.66|> 2035<|14.08|><|14.50|> sejumlah<|14.96|><|14.98|> 15%<|16.14|><|16.26|> penduduk<|16.62|><|16.68|> kita<|16.90|><|17.02|> adalah<|17.30|><|17.40|> daripada<|17.80|><|17.86|> kalangan<|18.16|><|18.22|> warga<|18.40|><|18.46|> emas.<|18.68|><|19.24|> Untuk<|19.40|><|19.46|> makluman<|19.86|><|20.64|> Tuan<|20.76|><|20.82|> Yang<|20.90|><|20.94|> Pertua<|21.14|><|21.20|> dan<|21.28|><|21.34|> juga<|21.50|><|21.58|> Alia<|21.70|><|21.76|> Mbah<|21.88|><|21.92|> Ahmad,<|22.08|><|22.22|> pembangunan<|22.66|><|22.72|> sistem<|23.00|><|23.06|> pendaftaran<|23.48|><|23.54|> warga<|23.72|><|23.78|> emas<|23.98|><|24.06|> ataupun<|24.36|><|24.42|> kita<|24.56|><|24.62|> sebutkan<|24.94|><|25.08|> event<|25.38|><|25.86|> adalah<|26.10|><|26.18|> usaha<|26.46|><|26.60|> kerajaan<|27.06|><|27.16|> kearah<|27.44|><|27.50|> merealisasikan<|28.36|><|28.86|> objektif<|29.36|><|29.42|> yang<|29.52|><|29.56|> telah<|29.72|><|29.76|> digarakan<|30.00|><|endoftext|>
108
  ```
109
 
110
+ **Make sure you already monkey patched `tokenization_whisper.TASK_IDS = ["translate", "transcribe", "transcribeprecise"]` at starting of your script**.