File size: 6,211 Bytes

---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
- mozilla-foundation/common_voice_13_0
language:
- hi
metrics:
- wer
base_model:
- theainerd/Wav2Vec2-large-xlsr-hindi
pipeline_tag: automatic-speech-recognition
library_name: transformers
---
# Model's Improvment

This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data.

# Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker

This model is a fine-tuned version of [theainerd/Wav2Vec2-large-xlsr-hindi](https://huggingface.co/theainerd/Wav2Vec2-large-xlsr-hindi) on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a **Word Error Rate (WER) of 54%**, compared to the base model’s WER of 72% on the same dataset.

## Model description

This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality.

## Intended uses & limitations

This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled. 


## Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load the Hindi Common Voice dataset
test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")

# Load the processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Function to process the dataset
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

# Perform inference
with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

# Evaluation
The model can be evaluated as follows on the Hindi test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

# Load the dataset and metrics
test_dataset = load_dataset("common_voice", "hi", split="test")
wer = load_metric("wer")

# Initialize processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model.to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'

# Function to preprocess the data
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Evaluation function
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
  with torch.no_grad():
      logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
      pred_ids = torch.argmax(logits, dim=-1)
      batch["pred_strings"] = processor.batch_decode(pred_ids)
      return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))



### Limitations:
- The model may face challenges with dialectal or regional variations within Hindi.
- Performance can degrade with noisy audio or overlapping speech.
- It is not intended for real-time transcription due to latency considerations.

## Training and evaluation data

The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi.

## Training procedure

### Hyperparameters and setup:

The following hyperparameters were used during training:
- **Learning rate**: 1e-4
- **Batch size**: 16 (per device)
- **Gradient accumulation steps**: 2
- **Evaluation strategy**: steps
- **Max steps**: 2500
- **Mixed precision**: FP16
- **Save steps**: 500
- **Evaluation steps**: 500
- **Logging steps**: 500
- **Warmup steps**: 500
- **Save total limit**: 1

### Training output

- **Global step**: 2500
- **Training runtime**: Approximately 1 hour 21 minutes
- **Epochs**: 5-6

### Training results

| Step | Training Loss | Validation Loss | WER    |
|------|---------------|-----------------|--------|
| 500  | 5.603000      | 0.987691       | 0.7556 |
| 1000 | 0.720300      | 0.667561       | 0.6196 |
| 1500 | 0.507000      | 0.592814       | 0.5844 |
| 2000 | 0.431100      | 0.549786       | 0.5439 |
| 2500 | 0.395600      | 0.537703       | 0.5428 |

### Framework versions
Transformers: 4.42.4
PyTorch: 2.3.1+cu121
Datasets: 2.20.0
Tokenizers: 0.19.1