Commit
·
d591b46
1
Parent(s):
fb69909
Update README
Browse files
README.md
CHANGED
@@ -1,3 +1,128 @@
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
---
|
4 |
+
# Whisper-Large-v2-hindi
|
5 |
+
|
6 |
+
This is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), fine-tuned on the following datasets:
|
7 |
+
| Dataset | Hours (Hi) | License | Source |
|
8 |
+
|----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------|
|
9 |
+
| **Shrutilipi** | ~1,558 h | CC BY 4.0 | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) |
|
10 |
+
| **IITM Madras SpringLab** | ~900 h | CC BY 4.0 | [SpringLab](https://asr.iitm.ac.in/dataset) |
|
11 |
+
| **Common Voice 11.0 (Mozilla)** | ~20 h | CC 0 1.0 (public domain) | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) |
|
12 |
+
| **IndicSUPERB** | 150 h | Apache License 2.0 | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) |
|
13 |
+
| **snow-mountain** | 67.6 h | CC BY-SA 4.0 | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) |
|
14 |
+
| **yodas** | ~200 h | CC BY 3.0 | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) |
|
15 |
+
| **IndicVoices-R_Hindi** | 75 h | CC BY 4.0 | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |
|
16 |
+
| **Lahaja** | 12.5 h | CC BY 4.0 | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) |
|
17 |
+
| **fleurs** | 30.0 h | CC BY 4.0 | [google/fleurs](https://huggingface.co/datasets/google/fleurs) |
|
18 |
+
|
19 |
+
The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.
|
20 |
+
|
21 |
+
## How to use
|
22 |
+
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
|
23 |
+
|
24 |
+
```python
|
25 |
+
>>> import torch
|
26 |
+
>>> from transformers import pipeline
|
27 |
+
>>> from datasets import load_dataset
|
28 |
+
|
29 |
+
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
30 |
+
|
31 |
+
>>> asr_pipe = pipe(
|
32 |
+
>>> "automatic-speech-recognition",
|
33 |
+
>>> model="collabora/whisper-large-v2-hindi",
|
34 |
+
>>> chunk_length_s=30,
|
35 |
+
>>> device=device
|
36 |
+
>>> )
|
37 |
+
|
38 |
+
>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
|
39 |
+
>>> sample = ds[0]["audio"]
|
40 |
+
>>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
|
41 |
+
{'text': ' हमने उस उम्मीदवार को चुना।', 'chunks': [{'timestamp': (0.0, 4.42), 'text': ' हमने उस उम्मीदवार को चुना।'}]}
|
42 |
+
```
|
43 |
+
|
44 |
+
## Intended Use
|
45 |
+
- The model is designed for high quality transcription in Hindi.
|
46 |
+
- And is suitable for academic use in ASR related tasks.
|
47 |
+
|
48 |
+
## Limitations
|
49 |
+
- May not perform well on noisy or low-quality audio.
|
50 |
+
- Focused primarily on Hindi.
|
51 |
+
|
52 |
+
### Model Performance
|
53 |
+
Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
|
54 |
+
```
|
55 |
+
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
|
56 |
+
```
|
57 |
+
|
58 |
+
After whisper normalization:
|
59 |
+
```
|
60 |
+
'कषतरफल बढन स उतप दन बढ'
|
61 |
+
```
|
62 |
+
|
63 |
+
So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
|
64 |
+
```
|
65 |
+
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
|
66 |
+
```
|
67 |
+
|
68 |
+
`openai-whisper/large-v2` baseline results on `google/fleurs -- hindi`:
|
69 |
+
```
|
70 |
+
Word Error Rate (WER) with whisper norm: 21.45 %
|
71 |
+
Word Error Rate (WER) with indic norm: 38.46 %
|
72 |
+
```
|
73 |
+
|
74 |
+
The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
|
75 |
+
```
|
76 |
+
Word Error Rate (WER) with whisper norm: 5.33 %
|
77 |
+
Word Error Rate (WER) with indic norm: 13.06 %
|
78 |
+
```
|
79 |
+
|
80 |
+
Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
|
81 |
+
|
82 |
+
### Acknowledgments
|
83 |
+
|
84 |
+
We thank the contributors and organizations behind the datasets:
|
85 |
+
|
86 |
+
- [AI4Bharat](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) for the Shrutilipi dataset.
|
87 |
+
|
88 |
+
- [IIT Madras SpringLab](https://asr.iitm.ac.in/dataset) for their springx-hindi dataset.
|
89 |
+
|
90 |
+
- [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) by Anoop Kunchukuttan for providing normalization tools that were crucial for evaluation.
|
91 |
+
|
92 |
+
|
93 |
+
### BibTeX entry and citation info
|
94 |
+
|
95 |
+
#### Model Citation
|
96 |
+
```bibtex
|
97 |
+
@misc{whisper-large-v2-hindi,
|
98 |
+
title = {Whisper-Large-v2 Fine-Tuned on Hindi},
|
99 |
+
author = {Collabora Ltd.},
|
100 |
+
year = {2025},
|
101 |
+
publisher = {Hugging Face},
|
102 |
+
note = {Fine-tuned using Shrutilipi and IITM Madras SpringLab datasets},
|
103 |
+
howpublished = {\url{https://huggingface.co/collabora/whisper-large-v2-hindi/}},
|
104 |
+
}
|
105 |
+
```
|
106 |
+
|
107 |
+
#### IndicNLP Library Citation
|
108 |
+
```
|
109 |
+
@misc{kunchukuttan2020indicnlp,
|
110 |
+
author = "Anoop Kunchukuttan",
|
111 |
+
title = "{The IndicNLP Library}",
|
112 |
+
year = "2020",
|
113 |
+
howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}}
|
114 |
+
}
|
115 |
+
```
|
116 |
+
|
117 |
+
#### AI4Bharat - Shrutilipi dataset
|
118 |
+
```bibtex
|
119 |
+
@misc{https://doi.org/10.48550/arxiv.2208.12666,
|
120 |
+
doi = {10.48550/ARXIV.2208.12666},
|
121 |
+
url = {https://arxiv.org/abs/2208.12666},
|
122 |
+
author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
|
123 |
+
title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages},
|
124 |
+
publisher = {arXiv},
|
125 |
+
year = {2022},
|
126 |
+
copyright = {arXiv.org perpetual, non-exclusive license}
|
127 |
+
}
|
128 |
+
```
|