makaveli10 commited on
Commit
d591b46
·
1 Parent(s): fb69909

Update README

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md CHANGED
@@ -1,3 +1,128 @@
1
  ---
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
  ---
4
+ # Whisper-Large-v2-hindi
5
+
6
+ This is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), fine-tuned on the following datasets:
7
+ | Dataset | Hours (Hi) | License | Source |
8
+ |----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------|
9
+ | **Shrutilipi** | ~1,558 h | CC BY 4.0 | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) |
10
+ | **IITM Madras SpringLab** | ~900 h | CC BY 4.0 | [SpringLab](https://asr.iitm.ac.in/dataset) |
11
+ | **Common Voice 11.0 (Mozilla)** | ~20 h | CC 0 1.0 (public domain) | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) |
12
+ | **IndicSUPERB** | 150 h | Apache License 2.0 | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) |
13
+ | **snow-mountain** | 67.6 h | CC BY-SA 4.0 | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) |
14
+ | **yodas** | ~200 h | CC BY 3.0 | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) |
15
+ | **IndicVoices-R_Hindi** | 75 h | CC BY 4.0 | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |
16
+ | **Lahaja** | 12.5 h | CC BY 4.0 | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) |
17
+ | **fleurs** | 30.0 h | CC BY 4.0 | [google/fleurs](https://huggingface.co/datasets/google/fleurs) |
18
+
19
+ The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.
20
+
21
+ ## How to use
22
+ The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
23
+
24
+ ```python
25
+ >>> import torch
26
+ >>> from transformers import pipeline
27
+ >>> from datasets import load_dataset
28
+
29
+ >>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
30
+
31
+ >>> asr_pipe = pipe(
32
+ >>> "automatic-speech-recognition",
33
+ >>> model="collabora/whisper-large-v2-hindi",
34
+ >>> chunk_length_s=30,
35
+ >>> device=device
36
+ >>> )
37
+
38
+ >>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
39
+ >>> sample = ds[0]["audio"]
40
+ >>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
41
+ {'text': ' हमने उस उम्मीदवार को चुना।', 'chunks': [{'timestamp': (0.0, 4.42), 'text': ' हमने उस उम्मीदवार को चुना।'}]}
42
+ ```
43
+
44
+ ## Intended Use
45
+ - The model is designed for high quality transcription in Hindi.
46
+ - And is suitable for academic use in ASR related tasks.
47
+
48
+ ## Limitations
49
+ - May not perform well on noisy or low-quality audio.
50
+ - Focused primarily on Hindi.
51
+
52
+ ### Model Performance
53
+ Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
54
+ ```
55
+ 'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
56
+ ```
57
+
58
+ After whisper normalization:
59
+ ```
60
+ 'कषतरफल बढन स उतप दन बढ'
61
+ ```
62
+
63
+ So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
64
+ ```
65
+ 'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
66
+ ```
67
+
68
+ `openai-whisper/large-v2` baseline results on `google/fleurs -- hindi`:
69
+ ```
70
+ Word Error Rate (WER) with whisper norm: 21.45 %
71
+ Word Error Rate (WER) with indic norm: 38.46 %
72
+ ```
73
+
74
+ The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
75
+ ```
76
+ Word Error Rate (WER) with whisper norm: 5.33 %
77
+ Word Error Rate (WER) with indic norm: 13.06 %
78
+ ```
79
+
80
+ Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
81
+
82
+ ### Acknowledgments
83
+
84
+ We thank the contributors and organizations behind the datasets:
85
+
86
+ - [AI4Bharat](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) for the Shrutilipi dataset.
87
+
88
+ - [IIT Madras SpringLab](https://asr.iitm.ac.in/dataset) for their springx-hindi dataset.
89
+
90
+ - [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) by Anoop Kunchukuttan for providing normalization tools that were crucial for evaluation.
91
+
92
+
93
+ ### BibTeX entry and citation info
94
+
95
+ #### Model Citation
96
+ ```bibtex
97
+ @misc{whisper-large-v2-hindi,
98
+ title = {Whisper-Large-v2 Fine-Tuned on Hindi},
99
+ author = {Collabora Ltd.},
100
+ year = {2025},
101
+ publisher = {Hugging Face},
102
+ note = {Fine-tuned using Shrutilipi and IITM Madras SpringLab datasets},
103
+ howpublished = {\url{https://huggingface.co/collabora/whisper-large-v2-hindi/}},
104
+ }
105
+ ```
106
+
107
+ #### IndicNLP Library Citation
108
+ ```
109
+ @misc{kunchukuttan2020indicnlp,
110
+ author = "Anoop Kunchukuttan",
111
+ title = "{The IndicNLP Library}",
112
+ year = "2020",
113
+ howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}}
114
+ }
115
+ ```
116
+
117
+ #### AI4Bharat - Shrutilipi dataset
118
+ ```bibtex
119
+ @misc{https://doi.org/10.48550/arxiv.2208.12666,
120
+ doi = {10.48550/ARXIV.2208.12666},
121
+ url = {https://arxiv.org/abs/2208.12666},
122
+ author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
123
+ title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages},
124
+ publisher = {arXiv},
125
+ year = {2022},
126
+ copyright = {arXiv.org perpetual, non-exclusive license}
127
+ }
128
+ ```