xyzDivergence commited on
Commit
2c3c35a
·
verified ·
1 Parent(s): 0876a12

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -53
README.md CHANGED
@@ -1,54 +1,55 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - vi
5
- metrics:
6
- - wer
7
- - cer
8
- base_model:
9
- - openai/whisper-small
10
- pipeline_tag: automatic-speech-recognition
11
- tags:
12
- - vietnamese
13
- - lyrics
14
- - alt
15
- - song
16
- - pytorch
17
- - whisper
18
- - transformers
19
- ---
20
- # Vietnamese Automatic Lyrics Transcription
21
- This project aims to perform automatic lyrics transcription on Vietnamese songs, the pre-trained model used for this task is Whisper from [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356).
22
-
23
- ## Fine-Tuning
24
- The model is fine-tuned on 8,000 Vietnamese songs scraped from zingmp3.vn (Vietnamese version of Spotify). The average song duration is 4.7 minutes, with a word per minute of 90.7.
25
-
26
- 7,000 Songs are used as training and 1,000 songs are used as validation. The reported metrics below are for the 1,000 validation songs.
27
-
28
- ## Evaluation
29
- | **Model** | **WER (Lowercase)** | **WER (Case-Sensitive)** | **CER (Lowercase)** | **CER (Case-Sensitive)** |
30
- |----------------------|--------------------|--------------------------|--------------------|--------------------------|
31
- | whisper-small | | | | |
32
- | whisper-medium | 23.15 | 26.42 | 17.01 | 17.03 |
33
- | whisper-large-v2 | 20.52 | 24.61 | 16.09 | 17.14 |
34
-
35
- ## Lyrics Transcription
36
- To generate the transcription for a song, we can use the Transformers [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline). Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`: In the following example we are passing `return_timestamps="word"` that provides precise timestamps for when each individual word in the audio starts and ends.
37
- ```python
38
- >>> from transformers import pipeline
39
- >>> asr_pipeline = pipeline(
40
- >>> "automatic-speech-recognition",
41
- >>> model="xyzDivergence/whisper-small-vietnamese-lyrics-transcription", chunk_length_s=30, device='cuda',
42
- >>> tokenizer="xyzDivergence/whisper-small-vietnamese-lyrics-transcription"
43
- >>> )
44
- >>> transcription = asr_pipeline("sample_audio.mp3", return_timestamps="word")
45
- ```
46
-
47
- ## Training Data
48
- The training dataset consists of 7,000 Vietnamese songs, in total of roughly 550 hours of audio, across various Vietnamese music genres, dialects and accents. Due to IP concerns, the data is not publicly available. Each song includes lyrics along with corresponding line-level timestamps, enabling precise mapping of audio segments to their respective lyrics based on the provided timestamp information.
49
-
50
- Technical report coming soon.
51
- This project was made through equal contributions from:
52
- - [Kevin Soh](https://github.com/kelvinbksoh)
53
- - [Bernard Cheng Zheng Zhuan](https://github.com/bernardcheng)
 
54
  - [Nguyen Quoc Anh](https://github.com/BatmanofZuhandArrgh)
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - vi
5
+ metrics:
6
+ - wer
7
+ - cer
8
+ base_model:
9
+ - openai/whisper-small
10
+ pipeline_tag: automatic-speech-recognition
11
+ tags:
12
+ - vietnamese
13
+ - lyrics
14
+ - alt
15
+ - song
16
+ - pytorch
17
+ - whisper
18
+ - transformers
19
+ ---
20
+ # Vietnamese Automatic Lyrics Transcription
21
+ This project aims to perform automatic lyrics transcription on Vietnamese songs, the pre-trained model used for this task is Whisper from [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356).
22
+
23
+ ## Fine-Tuning
24
+ The model is fine-tuned on 8,000 Vietnamese songs scraped from zingmp3.vn (Vietnamese version of Spotify). The average song duration is 4.7 minutes, with a word per minute of 90.7.
25
+
26
+ 7,000 Songs are used as training and 1,000 songs are used as validation. The reported metrics below are for the 1,000 validation songs.
27
+
28
+ ## Evaluation
29
+ | **Model** | **WER (%) Case-Sensitive** | **WER (%) Lowercase** | **CER (%) Case-Sensitive** | **CER (%) Lowercase** |
30
+ |----------------------|----------------------------|-----------------------|---------------------------|-----------------------|
31
+ | whisper-small | 34.91 | 30.73 | 24.82 | 23.65 |
32
+ | whisper-medium | 26.42 | 23.15 | 17.03 | 17.01 |
33
+ | whisper-large-v2 | 24.61 | 20.52 | 17.14 | 16.09 |
34
+
35
+
36
+ ## Lyrics Transcription
37
+ To generate the transcription for a song, we can use the Transformers [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline). Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`: In the following example we are passing `return_timestamps="word"` that provides precise timestamps for when each individual word in the audio starts and ends.
38
+ ```python
39
+ >>> from transformers import pipeline
40
+ >>> asr_pipeline = pipeline(
41
+ >>> "automatic-speech-recognition",
42
+ >>> model="xyzDivergence/whisper-small-vietnamese-lyrics-transcription", chunk_length_s=30, device='cuda',
43
+ >>> tokenizer="xyzDivergence/whisper-small-vietnamese-lyrics-transcription"
44
+ >>> )
45
+ >>> transcription = asr_pipeline("sample_audio.mp3", return_timestamps="word")
46
+ ```
47
+
48
+ ## Training Data
49
+ The training dataset consists of 7,000 Vietnamese songs, in total of roughly 550 hours of audio, across various Vietnamese music genres, dialects and accents. Due to IP concerns, the data is not publicly available. Each song includes lyrics along with corresponding line-level timestamps, enabling precise mapping of audio segments to their respective lyrics based on the provided timestamp information.
50
+
51
+ Technical report coming soon.
52
+ This project was made through equal contributions from:
53
+ - [Kevin Soh](https://github.com/kelvinbksoh)
54
+ - [Bernard Cheng Zheng Zhuan](https://github.com/bernardcheng)
55
  - [Nguyen Quoc Anh](https://github.com/BatmanofZuhandArrgh)