File size: 6,211 Bytes
beca360 306715c beca360 306715c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
- mozilla-foundation/common_voice_13_0
language:
- hi
metrics:
- wer
base_model:
- theainerd/Wav2Vec2-large-xlsr-hindi
pipeline_tag: automatic-speech-recognition
library_name: transformers
---
# Model's Improvment
This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data.
# Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker
This model is a fine-tuned version of [theainerd/Wav2Vec2-large-xlsr-hindi](https://huggingface.co/theainerd/Wav2Vec2-large-xlsr-hindi) on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a **Word Error Rate (WER) of 54%**, compared to the base model’s WER of 72% on the same dataset.
## Model description
This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality.
## Intended uses & limitations
This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled.
## Usage
The model can be used directly (without a language model) as follows:
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
# Load the Hindi Common Voice dataset
test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")
# Load the processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Function to process the dataset
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
# Perform inference
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
# Evaluation
The model can be evaluated as follows on the Hindi test data of Common Voice.
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
# Load the dataset and metrics
test_dataset = load_dataset("common_voice", "hi", split="test")
wer = load_metric("wer")
# Initialize processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model.to("cuda")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'
# Function to preprocess the data
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Evaluation function
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
### Limitations:
- The model may face challenges with dialectal or regional variations within Hindi.
- Performance can degrade with noisy audio or overlapping speech.
- It is not intended for real-time transcription due to latency considerations.
## Training and evaluation data
The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi.
## Training procedure
### Hyperparameters and setup:
The following hyperparameters were used during training:
- **Learning rate**: 1e-4
- **Batch size**: 16 (per device)
- **Gradient accumulation steps**: 2
- **Evaluation strategy**: steps
- **Max steps**: 2500
- **Mixed precision**: FP16
- **Save steps**: 500
- **Evaluation steps**: 500
- **Logging steps**: 500
- **Warmup steps**: 500
- **Save total limit**: 1
### Training output
- **Global step**: 2500
- **Training runtime**: Approximately 1 hour 21 minutes
- **Epochs**: 5-6
### Training results
| Step | Training Loss | Validation Loss | WER |
|------|---------------|-----------------|--------|
| 500 | 5.603000 | 0.987691 | 0.7556 |
| 1000 | 0.720300 | 0.667561 | 0.6196 |
| 1500 | 0.507000 | 0.592814 | 0.5844 |
| 2000 | 0.431100 | 0.549786 | 0.5439 |
| 2500 | 0.395600 | 0.537703 | 0.5428 |
### Framework versions
Transformers: 4.42.4
PyTorch: 2.3.1+cu121
Datasets: 2.20.0
Tokenizers: 0.19.1 |