|
--- |
|
language: en |
|
datasets: |
|
- patrickvonplaten/librispeech_asr_dummy |
|
metrics: |
|
- wer |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- en |
|
- speech |
|
--- |
|
|
|
# Fine-tuned facebook/wav2vec2-base large model for speech recognition in English |
|
|
|
Fine-tuned [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on English using the train and validation splits of [zodata](https://www.kaggle.com/datasets/mohamedk0emad/zodata). |
|
the dataset has 307912 transcibed voice samples, we used 6158 samples for training and 6036 samples for testing |
|
and the result for testing with WER accuracy metric is: |
|
Test WER: 0.340 |
|
|
|
When using this model, make sure that your speech input is sampled at 16kHz. |
|
|
|
This model has been fine-tuned thanks to the GPU credits given by the [kaggle](https://www.kaggle.com/) |
|
|
|
|
|
# Usage |
|
|
|
To transcribe audio files the model can be used as a standalone acoustic model as follows: |
|
|
|
```python |
|
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC |
|
from datasets import load_dataset |
|
import torch |
|
|
|
# load model and tokenizer |
|
processor = Wav2Vec2Processor.from_pretrained("souzan/zomodel") |
|
model = Wav2Vec2ForCTC.from_pretrained("souzan/zomodel") |
|
|
|
# load dummy dataset and read soundfiles |
|
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") |
|
|
|
# tokenize |
|
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1 |
|
|
|
# retrieve logits |
|
logits = model(input_values).logits |
|
|
|
# take argmax and decode |
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
``` |
|
## Evaluation |
|
|
|
This code snippet shows how to evaluate **facebook/wav2vec2-base-960h** on LibriSpeech's "clean" and "other" test data. |
|
|
|
```python |
|
from datasets import load_dataset |
|
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor |
|
import torch |
|
from jiwer import wer |
|
|
|
|
|
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") |
|
|
|
model = Wav2Vec2ForCTC.from_pretrained("souzan/zomodel").to("cuda") |
|
processor = Wav2Vec2Processor.from_pretrained("souzan/zomodel") |
|
|
|
def map_to_pred(batch): |
|
input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values |
|
with torch.no_grad(): |
|
logits = model(input_values.to("cuda")).logits |
|
|
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
batch["transcription"] = transcription |
|
return batch |
|
|
|
result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) |
|
|
|
print("WER:", wer(result["text"], result["transcription"])) |
|
``` |
|
|
|
|