collabora
/

whisper-large-v2-hindi

Safetensors

whisper

Model card Files Files and versions

xet

Community

makaveli10 commited on May 16

Commit

d591b46

1 Parent(s): fb69909

Update README

Browse files

Files changed (1) hide show

README.md +125 -0

README.md CHANGED Viewed

@@ -1,3 +1,128 @@
 ---
 license: cc-by-4.0
 ---

 ---
 license: cc-by-4.0
 ---
+# Whisper-Large-v2-hindi
+This is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), fine-tuned on the following datasets:
+| Dataset                                | Hours (Hi) | License                           | Source                                                                 |
+|----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------|
+| **Shrutilipi**                         | ~1,558 h   | CC BY 4.0                         | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi)                                       |
+| **IITM Madras SpringLab**              | ~900 h     | CC BY 4.0                         | [SpringLab](https://asr.iitm.ac.in/dataset)                            |
+| **Common Voice 11.0 (Mozilla)**        | ~20 h      | CC 0 1.0 (public domain)          | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)          |
+| **IndicSUPERB**                        | 150 h      | Apache License 2.0                | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB)                           |
+| **snow-mountain**                      | 67.6 h     | CC BY-SA 4.0                      | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/)        |
+| **yodas**                              | ~200 h     | CC BY 3.0                         | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas)                      |
+| **IndicVoices-R_Hindi**                | 75 h       | CC BY 4.0                         | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi)        |
+| **Lahaja**                             | 12.5 h     | CC BY 4.0                         | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja)    |
+| **fleurs**                             | 30.0 h     | CC BY 4.0                         | [google/fleurs](https://huggingface.co/datasets/google/fleurs)      |
+The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.
+## How to use
+The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
+```python
+>>> import torch
+>>> from transformers import pipeline
+>>> from datasets import load_dataset
+>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
+>>> asr_pipe = pipe(
+>>>     "automatic-speech-recognition",
+>>>     model="collabora/whisper-large-v2-hindi",
+>>>     chunk_length_s=30,
+>>>     device=device
+>>> )
+>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
+>>> sample = ds[0]["audio"]
+>>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
+{'text': ' हमने उस उम्मीदवार को चुना।', 'chunks': [{'timestamp': (0.0, 4.42), 'text': ' हमने उस उम्मीदवार को चुना।'}]}
+```
+## Intended Use
+- The model is designed for high quality transcription in Hindi.
+- And is suitable for academic use in ASR related tasks.
+## Limitations
+- May not perform well on noisy or low-quality audio.
+- Focused primarily on Hindi.
+### Model Performance
+Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
+```
+'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
+```
+After whisper normalization:
+```
+'कषतरफल बढन स उतप दन बढ'
+```
+So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
+```
+'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
+```
+`openai-whisper/large-v2` baseline results on `google/fleurs -- hindi`:
+```
+Word Error Rate (WER) with whisper norm: 21.45 %
+Word Error Rate (WER) with indic norm: 38.46 %
+```
+The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
+```
+Word Error Rate (WER) with whisper norm: 5.33 %
+Word Error Rate (WER) with indic norm:  13.06 %
+```
+Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
+### Acknowledgments
+We thank the contributors and organizations behind the datasets:
+- [AI4Bharat](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) for the Shrutilipi dataset.
+- [IIT Madras SpringLab](https://asr.iitm.ac.in/dataset) for their springx-hindi dataset.
+- [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) by Anoop Kunchukuttan for providing normalization tools that were crucial for evaluation.
+### BibTeX entry and citation info
+#### Model Citation
+```bibtex
+@misc{whisper-large-v2-hindi,
+  title = {Whisper-Large-v2 Fine-Tuned on Hindi},
+  author = {Collabora Ltd.},
+  year = {2025},
+  publisher = {Hugging Face},
+  note = {Fine-tuned using Shrutilipi and IITM Madras SpringLab datasets},
+  howpublished = {\url{https://huggingface.co/collabora/whisper-large-v2-hindi/}},
+}
+```
+#### IndicNLP Library Citation
+```
+@misc{kunchukuttan2020indicnlp,
+author = "Anoop Kunchukuttan",
+title = "{The IndicNLP Library}",
+year = "2020",
+howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}}
+}
+```
+#### AI4Bharat - Shrutilipi dataset
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2208.12666,
+  doi = {10.48550/ARXIV.2208.12666},
+  url = {https://arxiv.org/abs/2208.12666},
+  author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
+  title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages},
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```