Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French
This model is a fine-tuned version of facebook/wav2vec2-xls-r-1b on the MOZILLA-FOUNDATION/COMMON_VOICE_9_0 - FR dataset.
Usage
- To use on a local audio file without the language model
import torch
import torchaudio
from transformers import AutoModelForCTC, Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()
# path to your audio file
wav_path = "example.wav"
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0) # mono
# resample
if sample_rate != 16_000:
resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
waveform = resampler(waveform)
# normalize
input_dict = processor(waveform, sampling_rate=16_000, return_tensors="pt")
with torch.inference_mode():
logits = model(input_dict.input_values.to("cuda")).logits
# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]
- To use on a local audio file with the language model
import torch
import torchaudio
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()
model_sampling_rate = processor_with_lm.feature_extractor.sampling_rate
# path to your audio file
wav_path = "example.wav"
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0) # mono
# resample
if sample_rate != 16_000:
resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
waveform = resampler(waveform)
# normalize
input_dict = processor_with_lm(waveform, sampling_rate=16_000, return_tensors="pt")
with torch.inference_mode():
logits = model(input_dict.input_values.to("cuda")).logits
predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
Evaluation
- To evaluate on
mozilla-foundation/common_voice_9_0
python eval.py \
--model_id "bhuang/wav2vec2-xls-r-1b-cv9-fr" \
--dataset "mozilla-foundation/common_voice_9_0" \
--config "fr" \
--split "test" \
--log_outputs \
--outdir "outputs/results_mozilla-foundatio_common_voice_9_0_with_lm"
- To evaluate on
speech-recognition-community-v2/dev_data
python eval.py \
--model_id "bhuang/wav2vec2-xls-r-1b-cv9-fr" \
--dataset "speech-recognition-community-v2/dev_data" \
--config "fr" \
--split "validation" \
--chunk_length_s 5.0 \
--stride_length_s 1.0 \
--log_outputs \
--outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
- Downloads last month
- 9
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Datasets used to train bofenghuang/wav2vec2-xls-r-1b-cv9-fr
Evaluation results
- Test WER on Common Voice 9self-reported12.720
- Test WER (+LM) on Common Voice 9self-reported10.600
- Test WER on Robust Speech Event - Dev Dataself-reported24.280
- Test WER (+LM) on Robust Speech Event - Dev Dataself-reported20.850