File size: 6,211 Bytes
beca360
 
 
 
306715c
beca360
 
 
 
 
 
 
 
306715c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
- mozilla-foundation/common_voice_13_0
language:
- hi
metrics:
- wer
base_model:
- theainerd/Wav2Vec2-large-xlsr-hindi
pipeline_tag: automatic-speech-recognition
library_name: transformers
---
# Model's Improvment

This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data.

# Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker

This model is a fine-tuned version of [theainerd/Wav2Vec2-large-xlsr-hindi](https://huggingface.co/theainerd/Wav2Vec2-large-xlsr-hindi) on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a **Word Error Rate (WER) of 54%**, compared to the base model’s WER of 72% on the same dataset.

## Model description

This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality.

## Intended uses & limitations

This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled. 


## Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load the Hindi Common Voice dataset
test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")

# Load the processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Function to process the dataset
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

# Perform inference
with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

# Evaluation
The model can be evaluated as follows on the Hindi test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

# Load the dataset and metrics
test_dataset = load_dataset("common_voice", "hi", split="test")
wer = load_metric("wer")

# Initialize processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model.to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'

# Function to preprocess the data
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Evaluation function
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
  with torch.no_grad():
      logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
      pred_ids = torch.argmax(logits, dim=-1)
      batch["pred_strings"] = processor.batch_decode(pred_ids)
      return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))



### Limitations:
- The model may face challenges with dialectal or regional variations within Hindi.
- Performance can degrade with noisy audio or overlapping speech.
- It is not intended for real-time transcription due to latency considerations.

## Training and evaluation data

The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi.

## Training procedure

### Hyperparameters and setup:

The following hyperparameters were used during training:
- **Learning rate**: 1e-4
- **Batch size**: 16 (per device)
- **Gradient accumulation steps**: 2
- **Evaluation strategy**: steps
- **Max steps**: 2500
- **Mixed precision**: FP16
- **Save steps**: 500
- **Evaluation steps**: 500
- **Logging steps**: 500
- **Warmup steps**: 500
- **Save total limit**: 1

### Training output

- **Global step**: 2500
- **Training runtime**: Approximately 1 hour 21 minutes
- **Epochs**: 5-6

### Training results

| Step | Training Loss | Validation Loss | WER    |
|------|---------------|-----------------|--------|
| 500  | 5.603000      | 0.987691       | 0.7556 |
| 1000 | 0.720300      | 0.667561       | 0.6196 |
| 1500 | 0.507000      | 0.592814       | 0.5844 |
| 2000 | 0.431100      | 0.549786       | 0.5439 |
| 2500 | 0.395600      | 0.537703       | 0.5428 |

### Framework versions
Transformers: 4.42.4
PyTorch: 2.3.1+cu121
Datasets: 2.20.0
Tokenizers: 0.19.1